unitedstates / congressional-record

A parser for the Congressional Record.
Other
128 stars 40 forks source link

Failing to Download Requested ZIP Files #48

Open blaschke opened 2 years ago

blaschke commented 2 years ago

The download of congressional records fails if the govinfo.gov server cannot deliver the requested ZIP file within two minutes (i.e., four retries at 30 seconds intervals). Here's an example from the cr2.log file:

INFO:root:Sending request on 2022-06-28 14:09 DEBUG:urllib3.connectionpool:https://www.govinfo.gov:443 "GET /content/pkg/CREC-2018-01-10.zip HTTP/1.1" 503 None DEBUG:urllib3.util.retry:Incremented Retry for (url='/content/pkg/CREC-2018-01-10.zip'): Retry(total=2, connect=None, read=None, redirect=None, status=None) DEBUG:urllib3.connectionpool:Retry: /content/pkg/CREC-2018-01-10.zip DEBUG:urllib3.connectionpool:https://www.govinfo.gov:443 "GET /content/pkg/CREC-2018-01-10.zip HTTP/1.1" 503 None DEBUG:urllib3.util.retry:Incremented Retry for (url='/content/pkg/CREC-2018-01-10.zip'): Retry(total=1, connect=None, read=None, redirect=None, status=None) DEBUG:urllib3.connectionpool:Retry: /content/pkg/CREC-2018-01-10.zip DEBUG:urllib3.connectionpool:https://www.govinfo.gov:443 "GET /content/pkg/CREC-2018-01-10.zip HTTP/1.1" 503 None DEBUG:urllib3.util.retry:Incremented Retry for (url='/content/pkg/CREC-2018-01-10.zip'): Retry(total=0, connect=None, read=None, redirect=None, status=None) DEBUG:urllib3.connectionpool:Retry: /content/pkg/CREC-2018-01-10.zip DEBUG:urllib3.connectionpool:https://www.govinfo.gov:443 "GET /content/pkg/CREC-2018-01-10.zip HTTP/1.1" 503 None DEBUG:root:Request headers received with code 503 WARNING:root:Unexpected condition, not continuing: 503 WARNING:root:Failed to download file https://www.govinfo.gov/content/pkg/CREC-2018-01-10.zip WARNING:root:fdsysDL received report that download for 2018-01-10 did not complete. INFO:root:No record on this day, not trying to extract WARNING:root:[Errno 2] No such file or directory: 'output/2018/CREC-2018-01-10/mods.xml', skipping.

The corresponding ZIP file in the above example does indeed exist. It simply takes more than two minutes for the govinfo.gov server to create it on the fly. Please note that once the ZIP file is created, it is available for download (for an unknown amount of time, though at least 24 hours). Running the download code again a couple of minutes after the first failure then gives you the file without trouble.

Perhaps a solution to this is simply to send a request to the server for all congressional records in question in order for the server to create the ZIP files, then come around again to collect them in the hope that the server had enough time to create them.

Moreover, I noticed that the code runs through all dates within a given range to check if there are files available. It seems a waste of resources to do that. I suggest to use the govinfo API (https://api.govinfo.gov/docs/ ) to check first which congressional records are actually available within a given date range.

nclarkjudd commented 6 months ago

Hi @blaschke --- thanks, belatedly, for your report.

I wasn't in a position to handle anything other than critical bugs when you created this issue, although I should have responded anyway.

If you are still interested, you are welcome to send a MR that implements the changes you suggest. I simply had not noticed how badly this repo needed freshening up, and I am attending to that now. While I may not have the overhead to make this improvement myself, I have spruced things up here so that it should be easier to contribute fixes.