FEC Issues - Githubissues

spulec commented 10 years ago

http://www.fec.gov/fecig/fecig.shtml

I've taken a look at this one a few times, but am never able to come up with a good strategy. The majority of the reports do not list when they were published. I've also looked at various things like the Last-Modified header, but it does not seem to be very indicative.

Any thoughts?

audiodude commented 10 years ago

I looked at it over a month ago and came up with a similar conclusion: no publish dates.

I think the best strategy is to reach out to the OIG? But another strategy is to pre-parse the PDF and try to extract the date from there, which is what I thought of earlier but was too lazy to implement.

Hopefully you could come up with a mechanism to avoid having to download or parse the PDF multiple times, in the scraper and in the utils pipeline.

Thoughts?

konklone commented 10 years ago

I wrote the following to @audiodude over email back in early June:

So on dates, I think assuming 1st of the month is fine, as long as it's noted in the scraper.

In the example you linked to, you can actually find the date, "November 14, 2012", in there. It's also the Created date in the PDF metadata.

Let's see if we can get there without using the PDF text, though. It sounds like a recipe for verrrry brittle, long-tail sort of regex tweaking, and requires some system contorting besides.

How about some combination of:

Using the URL and the type of the report to infer what's going on. So a FY12 audit report can be assumed to be November of that year, and Nov 1 is a fine fake date.
Hardcoding exceptions: This Westlaw report is July 2001 (and really doesn't mention a specific date).
Maybe: PDF metadata as the final fallback for this one. Anecdotally, haven't found a mistake yet. You can even get a specific date for the Westlaw one that way.

If we did PDF metadata, it'd have to be done by the scraper setting a 'trust_pdf' flag for a report to True, which would let a None published_on and year slide through validation. The metadata extractor (which would probably be pdftk, introducing a new dependency) would then get the date -- and if it's not there, should throw an error.

Unsure of whether the PDF metadata pipeline is worthwhile, when it's so unreliable for other IGs, and introduces other complexity.

Do you think an analysis of report URLs and a table of hardcoded IDs is sufficient for the FEC? I'd be happy to reach out to the FEC IG's web team to try to get them to add more dates to their reports, too.

spulec commented 10 years ago

Closing in favor of resolving #101

spulec commented 10 years ago

Thanks!

unitedstates / inspectors-general

FEC Issues #97