Closed spulec closed 10 years ago
Looks great, @spulec, thanks again!
@konklone Now I'm going to expect images for every scraper :) I hope you have one prepared for some of the more interesting agencies like "Corporation for National And Community Service".
NASA OIG wrote back about a couple of 404 reports:
Thank you for notifying us about the broken links. But the reports that you are looking for do exist on the OIG’s Website.
“These two reports, I found links to, but I can't trace back where they were found:
http://oig.nasa.gov/audits/reports/memos/NSRS.pdf
http://oig.nasa.gov/audits/reports/memos/Monel-1.pdf “
Correct Link: Both of these reports are under FY03
http://oig.nasa.gov/memos/NSRS.pdf
http://oig.nasa.gov/memos/Monel-1.pdf
http://oig.nasa.gov/audits/reports/FY98/executive_summaries/ig-98-038es.htm
If you click on the “Report” link under the title it will pull the report up for you
http://oig.nasa.gov/audits/reports/FY98/pdfs/ig-98-038.pdf
In case anyone is interested in what happened
(Pdb) result.select("td")[3].text
' ../../../memos/NSRS.pdf'
(Pdb) landing_url
'http://oig.nasa.gov/audits/reports/FY03/tableData.html'
(Pdb) urljoin(landing_url, result.select("td")[3].text)
'http://oig.nasa.gov/audits/reports/memos/NSRS.pdf'
(Pdb) urljoin(landing_url, result.select("td")[3].text.strip())
'http://oig.nasa.gov/memos/NSRS.pdf'
If you pass in a relative url as the second argument to urljoin
that starts with a space, then urljoin
has somewhat unusual behavior.
This seems to remove all of the 404s. That's what I get for second-guessing NASA.
This one was pretty easy.
They list reports back to 1996 and 1997, but they don't give links to the actual reports. For now, I'm just skipping those, but they might be useful if someone wants to start issuing requests.