unitedstates / congressional-record

A parser for the Congressional Record.
Other
128 stars 40 forks source link

Parser doesn't pick up on a new speaker in 2 specific cases #18

Closed nclarkjudd closed 6 years ago

nclarkjudd commented 9 years ago

Consider 2013/CREC-2013-10-30/text/CREC-2013-10-30-pt1-PgH6909.txt and 2013/CREC-2013-10-16/text/CREC-2013-10-16-pt1-PgE1522-2.txt

Passing through the parser, Debbie Wasserman Schultz is not recognized as a new speaker and the new "speaking" tags are not added in either case. (Case one: Speaker A recognizes or yields the floor to Speaker B. Case two: Speaker A is entering additional comments into the record.)

It's a little baffling as to why, though.

Some items that have this problem in congressional_record don't appear to have the same problem in capitol_words. Consider DWS here: http://capitolwords.org/date/2009/07/15/H8117-4_energy-and-water-development-and-related-agencies-/

I don't think this has to do with DWS' name. Passing each line of the above two files from 2013 through re.search(re_newspeaker, line), we see that the pattern returns true and DWS' name isin the 'name' group both times. So nothing seems to be wrong with the regex.

We check for a new speaker only in very specific circumstances -- when there is a new paragraph. If we add a rule in is_new_paragraph such that a line matching re.newspeaker triggers a new paragraph, the resulting file is marked up in a way that is totally wrong. (Try it!) So the issue seems to be in the machinery around building new speaking and speaker tags.

nclarkjudd commented 6 years ago

Fixed with https://github.com/unitedstates/congressional-record/commit/39c70161673b0fa43e94565e8f0dc29116aae102