unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Add SIGAR #102

Closed spulec closed 10 years ago

spulec commented 10 years ago

This was great.

konklone commented 10 years ago

Wow - they have actual XML (not just RSS, but archival XML)! Did you find them linked/discussed anywhere on the site, or did you dig into their source code and see them being fetched?

konklone commented 10 years ago

The first report I checked out was this one:

{
  "agency": "sigar",
  "agency_name": "Special Inspector General for Afghanistan Reconstruction",
  "file_type": "pdf",
  "inspector": "sigar",
  "inspector_url": "http://www.sigar.mil",
  "published_on": "2014-03-18",
  "report_id": "SIGAR-14-42-AL",
  "title": "SIGAR 14-42-AL",
  "type": "report",
  "url": "http://www.sigar.mil/Audits/pdf/spotlight/SIGAR-14-42-AL.pdf",
  "year": 2014
}

And that links to a 404 for the report PDF. Others seem fine -- is this one a fluke?

spulec commented 10 years ago

I dug into the source code and saw them being fetched.

The 404s appear to not be a fluke. It looks like some of the xml files have different relative url formats than the others. I've added a fix that is a bit ugly, but it mimics the logic they have in their javascript.

As a side note: for debugging these types of issues, I've often added something like the following to save_report in inspector.py:

res = scraper.request(method='HEAD', url=report['url'])
assert res.status_code == 200

It might be worth adding some sort of super --dry_run option that makes HEAD requests. This would allow people writing scrapers to do a better level of validation without being required to actually download all the reports. Thoughts?

audiodude commented 10 years ago

+1 to just having dry-run do the HEAD requests. The main use of dry-run is the verify scrapers, and this would make the verification stronger. Although, I think the assertion should have a message about what failed and the URL.

konklone commented 10 years ago

Very good call. Filed #108 for that.

But I think this one's ready to go! Thanks for tackling it, @spulec.

afghan-afghanistan from above 5

konklone commented 10 years ago

That's an interesting 404, it tried to download an mp3. Not sure where it came from.

spulec commented 10 years ago

Okay, that file is linked here. You will have to click to page 17 or 18. The title is Acting IG Steven J Trent Discusses SIGAR and Reconstruction Issues on Federal News Radio (.MP3) | (PDF) so it is unclear whether they were trying to link to the mp3 or pdf.

konklone commented 10 years ago

OK, I've written to the webmaster about it. Thanks for identifying that.