propublica / Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
BSD 3-Clause "New" or "Revised" License
122 stars 34 forks source link

Paragraphs not getting broken up correctly #53

Open konklone opened 12 years ago

konklone commented 12 years ago

If you go to this page: http://capitolwords.org/date/2011/11/03/H7273-6_motion-to-instruct-conferees-on-hr-2112-agricultur

And Ctrl+F for "Despite Mayor", it jumps to a long long paragraph of text, which is actually many paragraphs. Besides jumbling the paragraphs together in the HTML, it also means the API returns huge strings in the 'speaking' array.

drinks commented 12 years ago

For posterity: http://cl.ly/image/3p1f2V092O3z Left pane: XML marked-up text Center pane: Solr text Right pane: Raw text

The whole block is treated as a quote, even though it appears to be submitted directly to the record. I'll figure out something to do with this...