unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Validation should fail on unrecognized file type #109

Open konklone opened 10 years ago

konklone commented 10 years ago

Right now, validation will fail if the file_type wasn't detected (the URL has no file extension) but will not fail if the detected file_type is unknown.

Since we only have text processors for HTML and PDF files, the file_type should be either auto-detected, or set by a scraper, to html or pdf. If it's not, it should choke and force the scraper to pick one -- and if we come across a report format that isn't HTML or PDF, then it's time to extend the system to process text from that format.

konklone commented 10 years ago

This can build on @divergentdave's work in https://github.com/unitedstates/inspectors-general/commit/1fa8f5d14584e09d138009bf273e2bf21c3ddecb, but that only patches the problem -- the file_type field should be html, for a report whose URL ends in .aspx, and the saved file should be report.html.

audiodude commented 10 years ago

:+1: I agree this is the more correct way to do it.