Consider 2013/CREC-2013-10-30/text/CREC-2013-10-30-pt1-PgH6909.txt
and 2013/CREC-2013-10-16/text/CREC-2013-10-16-pt1-PgE1522-2.txt
Passing through the parser, Debbie Wasserman Schultz is not recognized as a new speaker and the new "speaking" tags are not added in either case. (Case one: Speaker A recognizes or yields the floor to Speaker B. Case two: Speaker A is entering additional comments into the record.)
I don't think this has to do with DWS' name. Passing each line of the above two files from 2013 through re.search(re_newspeaker, line), we see that the pattern returns true and DWS' name isin the 'name' group both times. So nothing seems to be wrong with the regex.
We check for a new speaker only in very specific circumstances -- when there is a new paragraph. If we add a rule in is_new_paragraph such that a line matching re.newspeaker triggers a new paragraph, the resulting file is marked up in a way that is totally wrong. (Try it!) So the issue seems to be in the machinery around building new speaking and speaker tags.
Consider 2013/CREC-2013-10-30/text/CREC-2013-10-30-pt1-PgH6909.txt and 2013/CREC-2013-10-16/text/CREC-2013-10-16-pt1-PgE1522-2.txt
Passing through the parser, Debbie Wasserman Schultz is not recognized as a new speaker and the new "speaking" tags are not added in either case. (Case one: Speaker A recognizes or yields the floor to Speaker B. Case two: Speaker A is entering additional comments into the record.)
It's a little baffling as to why, though.
Some items that have this problem in congressional_record don't appear to have the same problem in capitol_words. Consider DWS here: http://capitolwords.org/date/2009/07/15/H8117-4_energy-and-water-development-and-related-agencies-/
I don't think this has to do with DWS' name. Passing each line of the above two files from 2013 through
re.search(re_newspeaker, line)
, we see that the pattern returns true and DWS' name isin the 'name' group both times. So nothing seems to be wrong with the regex.We check for a new speaker only in very specific circumstances -- when there is a new paragraph. If we add a rule in
is_new_paragraph
such that a line matchingre.newspeaker
triggers a new paragraph, the resulting file is marked up in a way that is totally wrong. (Try it!) So the issue seems to be in the machinery around building new speaking and speaker tags.