Closed spulec closed 10 years ago
Well I think the USDA is very interesting! This scraper seems pretty straightforward. The one thing I'd request is adding support for their Investigation Bulletins, like this one.
They're not individual "reports", really, but they're very high value, have the date extractable from the URL, and whatever values need to be made up to make them fit seem worth it to me. They make terrific FOIA leads, too. Is it easy enough to add them in?
Also, I just notice this report has a relative URL, while many others don't:
{
"agency": "aphis",
"agency_name": "Animal Plant Health Inspection Service",
"file_type": "pdf",
"inspector": "agriculture",
"inspector_url": "http://www.usda.gov/oig/",
"published_on": "2007-10-26",
"report_id": "33601-0009-CH_Redacted",
"title": "Controls Over Permits to Import Agricultural Products (PDF)",
"type": "report",
"url": "webdocs/33601-0009-CH_Redacted.pdf",
"year": 2007
}
This one too:
{
"agency": "rbeg",
"agency_name": "Rural Business Enterprise Grant",
"file_type": "pdf",
"inspector": "agriculture",
"inspector_url": "http://www.usda.gov/oig/",
"published_on": "2013-02-14",
"report_id": "34703-0001-31",
"title": "The Recovery Act - Rural Development's Rural Business Enterprise Grants Field Confirmations (PDF),",
"type": "report",
"url": "webdocs/34703-0001-31.pdf",
"year": 2013
}
Also, the url
field should obviously be checked in the validator to make sure it starts with http://
or https://
. I'll add that promptly.
OK, I added the validation in #77, and merged the fix into this branch.
Both issues addressed with the two most recent commits.
A+, thank you!
Interesting - there were 3 404s overnight:
All of them resolve fine if the .PDF
at the end is turned into .pdf
. I'm not sure a blind .lower()
is appropriate, though. Happy to resolve this when I get a sec.
Unfortunately, that seems to break some reports: http://www.usda.gov/oig/webdocs/04601-13-FM.PDF and http://www.usda.gov/oig/webdocs/FINALRPT.PDF.
There are about 20 reports that need to have PDF -> pdf. The most recent one is 1997 so I don't think this will necessarily be a problem going forward. Think we should just hardcode the report ids?
Yeah, if the most recent is 1997, and it's ~20, then I guess hardcoding is the best solution, ugly as it is. :/
Added with https://github.com/unitedstates/inspectors-general/commit/9d2c2ff403bc01f4c1173fb6ec151d4df4722fbb.
We could use a list comprehension to create that constant, but it felt a bit wrong for some reason. Feel free to change if you think that will be cleaner.
:+1: nah, looks great to me, thank you.
Nothing too interesting here.
Audits, Testimonies, and Semianuual reports working.