unitedstates / congressional-record

A parser for the Congressional Record.
Other
128 stars 40 forks source link

RegEx Updates #15

Closed Jace84 closed 9 years ago

Jace84 commented 9 years ago

Tweaked the RegEx a bit to generate more reliable data. For example in 1995 Mr. D'AMATO and Miss COLLINS were completely excluded from being parsed.

I also removed the lower case pattern in the RegEx as the record denotes all new speakers as such: Mr. NEWSPEAKER

Cspeisman commented 9 years ago

Hey @jprinc16 thank you for contributing. You make some great points regarding our regular expression and pointing out some cases we are missing.

It appears that your new code breaks on a few of the tests located in the test directory. Please be sure to run the tests locally and make sure your code passes on all the cases. Then, update your master branch and I'll be happy to merge in your changes!

Thanks, Corey

Jace84 commented 9 years ago

@Cspeisman I am working on this as we speak. Will hopefully have this fixed in short order.

I am passing locally.

python test/run

.....................

Ran 21 tests in 0.393s

OK jace@ubuntu:~/

I will work on making it cooperate with Travis CI.

Jace84 commented 9 years ago

@Cspeisman @LindsayYoung alright after going cross eyed for a few days I figured out the issue. Parser was not matching "of x" such as Ms. SEWELL of Alabama. This was because I changed the scope of the RegEx to where it would only match "of X" if it was a "Miss" speaking. All is well now, however, I am still miffed why the test were passing locally. If you have any ideas on this please let me know.

Short recap of this pull request:

Parser will now parse names with apostophes such as: Mr. D'AMATO Parser will now parse "Miss", such as: Miss Collins

Thanks for sharing your work and I am glad I could conbtribute.

Cspeisman commented 9 years ago

@jprinc16 Thanks so much - merging in now!

Not too sure why the local tests were passing for you but glad you could get to the bottom of it.