unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Dupe report IDs in 2014 and 2015 #197

Closed konklone closed 9 years ago

konklone commented 9 years ago

From a 2014+2015 re-archive, HHS reports these:

[hhs] Duplicate report_id: 61200041 has been used twice this session
[hhs] Duplicate report_id: 21102017 has been used twice this session
[hhs] Duplicate report_id: 21202013 has been used twice this session
[hhs] Duplicate report_id: 21101039 has been used twice this session
[hhs] Duplicate report_id: 91201001 has been used twice this session
[hhs] Duplicate report_id: 41201016 has been used twice this session
[hhs] Duplicate report_id: oei-09-11-00380 has been used twice this session

There are a few dupes in other scrapers:

[arc] Duplicate report_id: report14-21-sc-17044 has been used twice this session
[energy] Duplicate report_id: DOE-IG-0919 is saved under 2015 and 2014
[pbgc] Duplicate report_id: - has been used twice this session
[pbgc] Duplicate report_id: - has been used twice this session
[usps] Duplicate report_id: rarc-wp-14-014 has been used twice this session

The pbgc scraper shouldn't use - as an ID. I think that merits an update to the validation check, maybe that at least one alphanumeric character be present.

divergentdave commented 9 years ago

I just opened a PR to track work on this, #213.

When I re-ran the scrapers here, I didn't get anything for DOE, so that may just be a stale report on disk causing problems. It could be the OIG website changed the date, or we changed how we parsed the date.