unitedstates / congressional-record

A parser for the Congressional Record.
Other
128 stars 40 forks source link

Downloader not working right now #38

Closed BenQuigley closed 5 years ago

BenQuigley commented 5 years ago

This script appears not to be working right now. Example command:

$ python -m congressionalrecord.cli 1995-01-03 1995-01-10 pg --csvpath congressional-record\output```
FileNotFoundError: [Errno 2] No such file or directory: 'output/1995/CREC-1995-01-03.zip'

The logs show that we're not finding the file:

INFO:root:No download for https://www.govinfo.gov/content/pkg/CREC-1995-01-03.zip and terminating with unexpected condition.

I just went there myself and yep, govinfo isn't serving that way anymore. I paged through their documentation but wasn't able to find a quick fix to get the CREC info - maybe we need an API key? Am I doing something wrong?

BenQuigley commented 5 years ago

Apparently python -m congressionalrecord.cli 2018-01-03 2018-01-03 pg (with the date changed from the link I tried in my first post) is working - maybe there was just no data served up on that day.

DawsonHoney commented 5 years ago

I have been having this issue too. It appears to only happen on weekends an holidays where congress is not in session. Its definitely not the code that's the problem. Maybe just modify the part that skips days that aren't in session?

nclarkjudd commented 5 years ago

It seems like we need to change the downloader internals to anticipate the new govinfo behavior.

I think you are correct that the program will throw an error and close if it is asked to download a day for which Congress is not in session.

The expected behavior is that the program generates a logging message and moves on to the next day in that case.

nclarkjudd commented 5 years ago

try python -m congressionalrecord.cli 2017-01-02 2017-01-02 json

Look at the logfile in cr2.log:

INFO:root:Logging begins
DEBUG:root:Downloader object ready with params:
DEBUG:root:end=2017-01-02, do_mode=json
INFO:root:Sending request on 2018-12-27 11:26
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.govinfo.gov:443
DEBUG:urllib3.connectionpool:https://www.govinfo.gov:443 "GET /content/pkg/CREC-2017-01-02.zip HTTP/1.1" 200 0
DEBUG:root:Request headers received with code 200
INFO:root:Considering request successful.
INFO:root:No download for https://www.govinfo.gov/content/pkg/CREC-2017-01-02.zip and terminating with unexpected condition.

INFO:root:fdsysDL received expected condition True for downloader

Calling GovInfoDL from an ipython window I see that when you ask GovInfo to give you a /pkg url that doesn't exist, it gives you a response with status code 200 --- but the content of the response is empty. As written, GovInfoDL expects to get a non-200 status code whenever something went wrong.

So, the way to solve this problem without altering the structure of the program is to fix GovInfoDL. We want to do it this way because the program should catch and handle errors as close as possible to where they happen, rather than downstream (compare to #40)

nclarkjudd commented 5 years ago

Turns out I put the error handling for the request itself in downloadRequest, which is even deeper in the internals.

I fixed this issue by requiring downloadRequest to get a status code of 200 and a non-empty response, otherwise it will return an error condition.

Thanks @BenQuigley and @DawsonHoney for your help in finding, diagnosing, and patching this bug!