Closed spulec closed 10 years ago
Barring a refactor to support downloading PDFs before validating metadata, the first 3 definitely come before the last 2. But the order of those first 3 seems completely determined by the quality of the IG's data. Some of the IGs, using Last-Modified
is just too often wrong to be acceptable, and some IGs have too many reports that'd need hard-coding.
So I think it's two classes of rules, with instructions to analyze the IG for which among the first class should be used first.
Okay, so generally:
I think that actually covers all of the agencies I've seen so far so let's just ignore the last two for the time being. I'll make a small modification to the readme and then close this for now.
Added with 0f2f40b95d0737f8240eddb8bdc1ee6d1a9e8a17
:+1:
As we get down to some of the last agencies, more and more seem to have incomplete information around published dates. It would be nice to come up with an "order of operations" for ways to try to deal with these websites.
Some possible ideas in no particular order:
Am I missing any? Thoughts on an order?
If we can come to some agreement, I'm happy to write up some additional docs.