unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Some HTML-format reports are split into multiple web pages #112

Open divergentdave opened 10 years ago

divergentdave commented 10 years ago

Some reports are split up over multiple web pages, and we're only fetching the table of contents thus far. For example, http://oig.federalreserve.gov/reports/board-full-report-20140312a.htm points to an executive summary and five sub-pages. This will require multiple URLs and files per report, or maybe crawling the pages we need and stuffing them in a WARC archive, for example. Perhaps scrapers will have to provide a list of URLs for such reports, rather than a single URL.

divergentdave commented 9 years ago

Here's a pathological corner case, a PDF file that links to more PDF files

http://www.epa.gov/oig/reports/2002/Models.pdf