Closed spulec closed 10 years ago
Good lord, HHS really did not make this easy.
What if we were to focus this primarily on recent data and data going forward? Do recent reports suffer the same inability to infer accurate dates as those from the 90's?
Yeah, it's pretty messy. I probably should have opened this before going as deep as I went.
It seems that relatively recent reports still suffer. See "Adverse Events in Hospitals: Medicare's Responses to Alleged Serious Events" linked from here. That report was published in October 2011. That was just the first page, but I can probably find more recent.
Sadly, the best option here is probably to use PDF tags. The current pipeline really has no good way to handle this, and validating a report without downloading the full PDF is a very important method in debugging and validating scrapers.
Wild idea: what about a HEAD
request that looks for Last-Modified? Dangerous, but if HHS is consistent about it, we could go with it.
$ curl --head https://oig.hhs.gov/oei/reports/oei-01-08-00590.pdf
HTTP/1.1 200 OK
Content-Length: 1068716
Content-Type: application/pdf
Last-Modified: Fri, 28 Oct 2011 16:19:43 GMT
Accept-Ranges: bytes
ETag: "958f16678d95cc1:29c9"
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Date: Wed, 02 Jul 2014 02:00:03 GMT
Hmm, that's definitely an interesting idea. I tested about a dozen recent reports and it seemed accurate for all of them (within a few days). It looks like a lot of the old reports were uploaded some time in 2002 so we would either need to exclude those or use the current hack system.
As you mentioned this could obviously be dangerous if a report is updated or replaced with a different copy for some reason. I actually feel pretty good about it though given my small test. The worst case scenario I can think of is a bunch of old reports suddenly start reappearing with new published dates. This is obviously suboptimal, but I think it would be spotted fairly quickly.
I'd be fine with this method for reports for 2002 onwards, and the current hack system for older ones (at least the hack will be bounded). That worst case, where older reports suddenly appear again, is not really that bad.
And this provides some good fodder to bring up with HHS - If you wouldn't mind adding some comments about all this to the top of the scraper, I'll contact HHS to bring them and our HHS discussions to their attention. Not high hopes, but you never know -- adding dates next to reports is not the heaviest lift in the world.
With the last commit, we use the Last-Modified header for reports that were published after 2002 and fallback to using the hack report id for reports before that.
All right, this seems worth testing for use. One big downside is that this results in a lot of extra requests to HHS during the script's run, even using --dry-run
. But whatever, it's HHS' fault.
I just did some testing and this looks great to me!
I just commented out the pdb
calls left in there, and am going to try downloading HHS' work to my servers. Do please go in and tinker further (on master, since I'm merging) to make the script more stable, if some of that debugging work wasn't actually left finished.
I did some cleanup on master and it looks a bit better now.
I'm planning on taking a look at the archives this week and evaluating how much uglier I would need to make the script to add them.
I added the archives for a few topics with https://github.com/unitedstates/inspectors-general/commit/c2e748d20db4c0c7d9a3c2c50f55e8311dfd365c, but the rest look like they are going to be a pain. I've added appropriate comments for the ones that I didn't add.
How easy would it be to try to grab the date for a report from the listing, to decide whether it's worth scraping the landing page for a report?
Right now, getting even just the current year (the default) takes a long time with this scraper, because it needs to fetch the landing page for every single report before deciding whether to skip them as out of range. So it actually requests the HHS' entire archive, no matter what the filters are.
I see the dates on the listing pages, but I know there's a lot of entropy overall with this scraper, so I'm not sure whether it's easy or not.
Those values aren't always there so it won't work all the time, but I added a quick check with https://github.com/unitedstates/inspectors-general/commit/3969e9b899b34470d1c9eaa8e8e18402cd9182c3 that should speed things up a lot.
This scraper still needs some work, but want to get the discussion started per #59. There are still a lot of pdbs and ugly code that I'm using for debugging.
All of the uncommented topics seem to be working.
No work has been done on the archive pages yet.
Lines 305-311 are pretty hacky. Some pages just don't report when they are published. See "Implementation of the Core Medical Services Requirement in the Ryan White Program" on http://oig.hhs.gov/reports-and-publications/oei/r.asp. I did notice that the report ids seem to specify month/year combinations that are about 1-2 years before when the report is published. My guess is that this datetime is given when the report id is requested. I'm using this as a fallback for some reports. This is obviously suboptimal, but the alternative is hardcoding a couple hundred reports. Maybe add an additional field noting that this is not the real published date?