Closed spulec closed 10 years ago
Wow - they have actual XML (not just RSS, but archival XML)! Did you find them linked/discussed anywhere on the site, or did you dig into their source code and see them being fetched?
The first report I checked out was this one:
{
"agency": "sigar",
"agency_name": "Special Inspector General for Afghanistan Reconstruction",
"file_type": "pdf",
"inspector": "sigar",
"inspector_url": "http://www.sigar.mil",
"published_on": "2014-03-18",
"report_id": "SIGAR-14-42-AL",
"title": "SIGAR 14-42-AL",
"type": "report",
"url": "http://www.sigar.mil/Audits/pdf/spotlight/SIGAR-14-42-AL.pdf",
"year": 2014
}
And that links to a 404 for the report PDF. Others seem fine -- is this one a fluke?
I dug into the source code and saw them being fetched.
The 404s appear to not be a fluke. It looks like some of the xml files have different relative url formats than the others. I've added a fix that is a bit ugly, but it mimics the logic they have in their javascript.
As a side note: for debugging these types of issues, I've often added something like the following to save_report
in inspector.py
:
res = scraper.request(method='HEAD', url=report['url'])
assert res.status_code == 200
It might be worth adding some sort of super --dry_run option that makes HEAD requests. This would allow people writing scrapers to do a better level of validation without being required to actually download all the reports. Thoughts?
+1 to just having dry-run do the HEAD requests. The main use of dry-run is the verify scrapers, and this would make the verification stronger. Although, I think the assertion should have a message about what failed and the URL.
That's an interesting 404, it tried to download an mp3. Not sure where it came from.
Okay, that file is linked here. You will have to click to page 17 or 18. The title is Acting IG Steven J Trent Discusses SIGAR and Reconstruction Issues on Federal News Radio (.MP3) | (PDF)
so it is unclear whether they were trying to link to the mp3 or pdf.
OK, I've written to the webmaster about it. Thanks for identifying that.
This was great.