Closed spulec closed 10 years ago
I meant to run this with --dry_run
but I forgot, and I'm glad I did. When run without it, I get the following:
$ python inspectors/eeoc.py
[report][2014-03-31][oig-3-2014]
report: eeoc/2014/oig-3-2014/report.cfm
Unknown file type, don't know how to extract metadata!
Unknown file type, don't know how to extract text!
text: None
data: eeoc/2014/oig-3-2014/report.json
It seems pretty obvious and straightforward what the problem is, the inspector utils don't know how to deal with .cfm files. We need to somehow let the utils know that it's an HTML file, the URL just has the wrong extension.
@audiodude, fixed the HTML heuristic in 1fa8f5d
Thanks @divergentdave, that's the right call. The scrapers should "manually" set the file_type
field when it can't be auto-detected correctly. Very nice catch, @audiodude.
In fact, it seems to me the validator should choke when it auto-detects an unknown file type, like .cfm
, and the scraper should be forced to supply a file_type
. I opened up #109 for that.
Annoyingly, a whole bunch of the old semiannual reports are 404s:
and a bunch more. I'll report that to the EEOC OIG.
Anyway, this works great. I'm not letting #109 hold this up, as having report.cfm
vs report.html
on disk won't actually affect downstream clients, who only use the extracted report.txt
version, and text extraction works now after @divergentdave's fix.
Thanks @spulec for kicking this off, and to @audiodude and @divergentdave for helping out!
I forgot that the EEOC is the agency with the outstanding badge I found in May for my Sunlight blog post. It deserves a posting of its own:
Add the OIG for the U.S. Equal Employment Opportunity Commission.