unitedstates / congressional-record

A parser for the Congressional Record.
Other
128 stars 40 forks source link

Unfork the Congressional Record parser! #23

Closed nclarkjudd closed 7 years ago

nclarkjudd commented 7 years ago

This PR would replace the old Sunlight version of congressional-record with my fork of the repository.

For anyone who is using this parser and expects XML, this is a breaking change. The parser can give you JSON or CSV output.

konklone commented 7 years ago

@nclarkjudd I recognize this is a huge set of changes over a mostly unmaintained thing, so it's not the biggest deal if there's a breaking test, but do you happen to know why one test is breaking?

It's from the Travis build:

======================================================================
FAIL: test_noTextInLineBreaks_Fresh (tests.test_suite.testJson)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/unitedstates/congressional-record/tests/test_suite.py", line 137, in test_noTextInLineBreaks_Fresh
    'Check {0}'.format(apath))
AssertionError: Check output/2014/CREC-2014-01-10/json/CREC-2014-01-10-pt1-PgH139-3.json
konklone commented 7 years ago

In any case, I'm a :+1: to merge and make you a maintainer of the repo.

@LindsayYoung @jcarbaugh @drinks Any thoughts on this PR and letting the repo move in the direction @nclarkjudd is taking it?

nclarkjudd commented 7 years ago

Yup, I know exactly why it fails. Right now the parser can identify only two or three different things, the most important of which is a speech. Because it can only really pick up on the beginning and ending of speeches, it assigns a lot of text to elements labeled "line breaks." A complete parser would have nothing inside line breaks except whitespace, or at the very least no text. Reasonable people might take issue with the way I have extended the test suite -- right now it downloads a day from the GPO at random and runs a number of tests on it, including this one. So, depending on the day it pulls, this test will actually pass. For what it's worth, I've tracked down a number of these to make sure the text in line breaks aren't speeches.

konklone commented 7 years ago

Good enough for me! Seems like eventually a deterministic test suite would be ideal (with maybe a non-deterministic one that can be run one-off, and not every build).

Merging in, and thank you for being up for moving your work into our happy home!