propublica / Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
BSD 3-Clause "New" or "Revised" License
122 stars 34 forks source link

speaker_raw capitalization #59

Closed haincha closed 11 years ago

haincha commented 11 years ago

Nothing too crazy. Just something I noticed. Trying to pull the ['speaker_raw'] field gives you a lower case phrase. mr. such andso. Is there anyway this can be corrected, or would this have to be manually fixed in each article in the database?

drinks commented 11 years ago

@haincha, First and foremost, sorry! For some reason I didn't get an email about this issue. I wouldn't recommend using the speaker_raw field for any practical display use--it's not intended to be more than a paper trail of the actual text encountered in the record, more useful in debugging than anything worthwhile. In most cases, the original text is all uppercase--so there's not a great naïve source of data for correctly casing names to begin with, but you're correct in the observation that our Solr index stores it case-insensitively. I'd point you instead toward a combination of speaker_first and speaker_last, or even resolving the speaker_bioguide against something like https://github.com/unitedstates/congress-legislators to get names.

haincha commented 11 years ago

Oh, it is quite okay. It was just something I had encountered while working through the Codecademy API course. I appreciate the email back.

Chase — Sent from Mailbox for iPhone

On Thu, Feb 21, 2013 at 1:37 PM, Dan Drinkard notifications@github.com wrote:

@haincha, First and foremost, sorry! For some reason I didn't get an email about this issue. I wouldn't recommend using the speaker_raw field for any practical display use--it's not intended to be more than a paper trail of the actual text encountered in the record, more useful in debugging than anything worthwhile. In most cases, the original text is all uppercase--so there's not a great naïve source of data for correctly casing names to begin with, but you're correct in the observation that our Solr index stores it case-insensitively. I'd point you instead toward a combination of speaker_first and speaker_last, or even resolving the speaker_bioguide against something like https://github.com/unitedstates/congress-legislators to get names.

Reply to this email directly or view it on GitHub: https://github.com/sunlightlabs/Capitol-Words/issues/59#issuecomment-13911401