unitedstates / congressional-record

A parser for the Congressional Record.
Other
128 stars 41 forks source link

Parser expects the raw text to be preceded by a blank line #4

Closed drinks closed 10 years ago

drinks commented 10 years ago

Super brittle!

LindsayYoung commented 10 years ago

I tried adding an extra line when necessary locally and that fixed the chamber and page identification problem but I got the same list out of range problem on 2 Senate docs-

CREC-2014-01-21-pt1-PgS463-2.txt

CREC-2014-01-21-pt1-PgS463-3.txt

Dan, did adding a line solve this problem for you or is this a different problem?

Lindsay Young 202-742-1520 x243 Sunlight Foundation http://www.sunlightfoundation.com/

On Fri, Jan 24, 2014 at 2:12 PM, Dan Drinkard notifications@github.comwrote:

Super brittle!

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congressional-record/issues/4 .

drinks commented 10 years ago

Can you post your log output? I saw all files parse correctly after making that adjustment.

LindsayYoung wrote:

I tried adding an extra line when necessary locally and that fixed the chamber and page identification problem but I got the same list out of range problem on 2 Senate docs-

CREC-2014-01-21-pt1-PgS463-2.txt

CREC-2014-01-21-pt1-PgS463-3.txt

Dan, did adding a line solve this problem for you or is this a different problem?

Lindsay Young 202-742-1520 x243 Sunlight Foundation http://www.sunlightfoundation.com/

On Fri, Jan 24, 2014 at 2:12 PM, Dan Drinkard notifications@github.comwrote:

Super brittle!

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congressional-record/issues/4 .

— Reply to this email directly or view it on GitHub https://github.com/unitedstates/congressional-record/issues/4#issuecomment-33264486.

LindsayYoung commented 10 years ago

It was more docs than I thought. It was the same amount of files not getting parsed. The log is pasted below. But the files that do go through, are parsed correctly this time unlike before the space was added.

$ python2.7 parser.py -id ../crtest

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgE109-2.txt: list index out of range

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgE109-3.txt: list index out of range

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgE109-4.txt: list index out of range

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1247-10.xml to disk

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1247-2.xml to disk

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1247-3.xml to disk

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgH1247-4.txt: list index out of range

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1247-5.xml to disk

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgH1247-6.txt: list index out of range

flag status: False

no match-- orphaned

Orphaned Tags:

('', 5, 16, 13) print

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1247-7.xml to disk

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1247-8.xml to disk

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgH1247-9.txt: list index out of range

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1247.xml to disk

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgH1249-2.xml to disk

UNRECOGNIZED STATE (but that's ok): To the Senate:

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgS463-2.txt: list index out of range

flag status: False

Error processing file: ../crtest/CREC-2014-01-21-pt1-PgS463-3.txt: list index out of range

flag status: False

saved file /Users/lindsayyoung/Dropbox/Projects/crtest/__parsed/CREC-2014-01-21-pt1-PgS463.xml to disk

drinks commented 10 years ago

Sorted offline, source data was the culprit.