unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Add NASA #98

Closed spulec closed 10 years ago

spulec commented 10 years ago

This one was pretty easy.

They list reports back to 1996 and 1997, but they don't give links to the actual reports. For now, I'm just skipping those, but they might be useful if someone wants to start issuing requests.

konklone commented 10 years ago

Looks great, @spulec, thanks again!

outer-space_00399584

spulec commented 10 years ago

@konklone Now I'm going to expect images for every scraper :) I hope you have one prepared for some of the more interesting agencies like "Corporation for National And Community Service".

konklone commented 10 years ago

NASA OIG wrote back about a couple of 404 reports:

Thank you for notifying us about the broken links.  But the reports that you are looking for do exist on the OIG’s Website.

“These two reports, I found links to, but I can't trace back where they were found:
http://oig.nasa.gov/audits/reports/memos/NSRS.pdf
http://oig.nasa.gov/audits/reports/memos/Monel-1.pdf “

Correct Link: Both of these reports are under FY03
http://oig.nasa.gov/memos/NSRS.pdf 
http://oig.nasa.gov/memos/Monel-1.pdf 

http://oig.nasa.gov/audits/reports/FY98/executive_summaries/ig-98-038es.htm
If you click on the “Report” link under the title it will pull the report up for you
http://oig.nasa.gov/audits/reports/FY98/pdfs/ig-98-038.pdf
spulec commented 10 years ago

In case anyone is interested in what happened

(Pdb) result.select("td")[3].text
' ../../../memos/NSRS.pdf'
(Pdb) landing_url
'http://oig.nasa.gov/audits/reports/FY03/tableData.html'
(Pdb) urljoin(landing_url, result.select("td")[3].text)
'http://oig.nasa.gov/audits/reports/memos/NSRS.pdf'
(Pdb) urljoin(landing_url, result.select("td")[3].text.strip())
'http://oig.nasa.gov/memos/NSRS.pdf'

If you pass in a relative url as the second argument to urljoin that starts with a space, then urljoin has somewhat unusual behavior.

This seems to remove all of the 404s. That's what I get for second-guessing NASA.