unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

FEC Issues #97

Closed spulec closed 10 years ago

spulec commented 10 years ago

http://www.fec.gov/fecig/fecig.shtml

I've taken a look at this one a few times, but am never able to come up with a good strategy. The majority of the reports do not list when they were published. I've also looked at various things like the Last-Modified header, but it does not seem to be very indicative.

Any thoughts?

audiodude commented 10 years ago

I looked at it over a month ago and came up with a similar conclusion: no publish dates.

I think the best strategy is to reach out to the OIG? But another strategy is to pre-parse the PDF and try to extract the date from there, which is what I thought of earlier but was too lazy to implement.

Hopefully you could come up with a mechanism to avoid having to download or parse the PDF multiple times, in the scraper and in the utils pipeline.

Thoughts?

konklone commented 10 years ago

I wrote the following to @audiodude over email back in early June:


So on dates, I think assuming 1st of the month is fine, as long as it's noted in the scraper.

In the example you linked to, you can actually find the date, "November 14, 2012", in there. It's also the Created date in the PDF metadata.

Let's see if we can get there without using the PDF text, though. It sounds like a recipe for verrrry brittle, long-tail sort of regex tweaking, and requires some system contorting besides.

How about some combination of:

If we did PDF metadata, it'd have to be done by the scraper setting a 'trust_pdf' flag for a report to True, which would let a None published_on and year slide through validation. The metadata extractor (which would probably be pdftk, introducing a new dependency) would then get the date -- and if it's not there, should throw an error.

Unsure of whether the PDF metadata pipeline is worthwhile, when it's so unreliable for other IGs, and introduces other complexity.

Do you think an analysis of report URLs and a table of hardcoded IDs is sufficient for the FEC? I'd be happy to reach out to the FEC IG's web team to try to get them to add more dates to their reports, too.

spulec commented 10 years ago

Closing in favor of resolving #101

spulec commented 10 years ago

Thanks!