propublica / Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
BSD 3-Clause "New" or "Revised" License
122 stars 34 forks source link

Title changes sometimes are not correctly unattributed #61

Open drinks opened 11 years ago

drinks commented 11 years ago

Reference (search "By Mr. HARKIN (for himself,"):

http://www.gpo.gov/fdsys/pkg/CREC-2013-03-05/html/CREC-2013-03-05-pt1-PgS1129.htm vs. http://capitolwords.org/date/2013/03/05/S1129_statements-on-introduced-bills-and-joint-resolutio/

Suspect a whitespace issue with the parser.

drinks commented 11 years ago

Relevant line introduced in b091e8a7: https://github.com/sunlightlabs/Capitol-Words/blame/master/parser/parser.py#L163

Expects titles to be a single line.

drinks commented 11 years ago

Encapsulating logic seems to be:

Should be attributed as 'recorder' The previous line with 8 centered underscores is truncated to empty by clean_line().