Closed spulec closed 9 years ago
This is running for me! :+1:
@spule -- I don't have the entire corpus of scraped reports for state, unfortunately. @konklone, any chance you have them on oversight.io? The thorough solution is probably to download and analyze the corpus from the Internet Archive, which is now possible! I'm glad to do that, but it's a big (34 GB!) download!
@konklone already sent me just the state archives the other day. I'm doing an analysis and working on writing an email to State OIG to notify them of the couple reports that are now missing. I'm hoping to wrap it up in the next couple of days. Feel free to send me an email if you want the state archives.
Oh, awesome! If there's anything I can do to help, let me know.
Most of the missing ones were Congressional Testimony that I found available at http://oig.state.gov/testimony-news. I have updated the script to pull those too.
There were three additional reports missing:
1.) 228989 ("Inspection of the Office of Cuba Broadcasting"): this report seems to be changed to report id 228991
2.) 162347 ("Audit of Department of State Controls Over Bureau of Diplomatic Security Domestic Firearms and Optics (AUD/SI-11-25)"): this report is missing
3.) 211870 ("Audit of Department of State Compliance With Physical/Procedural Security Standards at Selected High Threat Level Posts (AUD-SI-13-32)"): this report is missing
I have sent an email to State OIG about the two missing reports.
State OIG responded that they are still in the migration process but the reports should be available within the next couple of days. I will keep an eye on it.
Thanks for doing the legwork, @spulec.
I'd like to merge this so that we can get the new reports from the site, but still keep an issue open for dealing with the missing old reports. I have my old cache of state reports from the old scraper, and I'll back them up in S3, so we can always use them for analysis/restoration later.
I followed up with state to see if they have a timeframe for the rest of the migration.
See #177
The report ids are the same as before. It appears that some more reports have been added and some old ones have been removed. Based on some old data I have, the old state scraper would get ~500 reports, while the new system gets over 1,000. The oldest in the old system was 1994, while the oldest in the new system is 2004. If someone with a more complete dataset could run the new one and do some better analysis, that would be great.