Open konklone opened 10 years ago
This can build on @divergentdave's work in https://github.com/unitedstates/inspectors-general/commit/1fa8f5d14584e09d138009bf273e2bf21c3ddecb, but that only patches the problem -- the file_type
field should be html
, for a report whose URL ends in .aspx
, and the saved file should be report.html
.
:+1: I agree this is the more correct way to do it.
Right now, validation will fail if the
file_type
wasn't detected (the URL has no file extension) but will not fail if the detectedfile_type
is unknown.Since we only have text processors for HTML and PDF files, the
file_type
should be either auto-detected, or set by a scraper, tohtml
orpdf
. If it's not, it should choke and force the scraper to pick one -- and if we come across a report format that isn't HTML or PDF, then it's time to extend the system to process text from that format.