unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Extract more metadata using pdfinfo #88

Closed divergentdave closed 10 years ago

divergentdave commented 10 years ago

WIP, still testing this. Addresses #76. There will probably be a merge conflict with #84.

divergentdave commented 10 years ago

This is ready to go now, let me know what you think.

konklone commented 10 years ago

Fixed a merge conflict caused by #87, and this is good to go. Thanks, @divergentdave!

konklone commented 10 years ago

This exciting new CreationDate format from a Dept of Education IG report caused a crash:

CreationDate:   D:00000101000000Z
ModDate:        Tue Sep 25 07:26:39 2001

The crash comes from the fact that if the time format fails, my_datetime hasn't been defined after the time parse attempts:

  File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 171, in parse_pdf_datetime
    if my_datetime:

UnboundLocalError: local variable 'my_datetime' referenced before assignment

I fixed the crash in https://github.com/unitedstates/inspectors-general/commit/f98044d62d3b0472aa7d99a41a895d05d3e9e73d, but am re-opening to see what you make of this new format -- I have no idea how to parse it, or if it's invalid.

konklone commented 10 years ago

Er, well, can't re-open a PR, but I'm going to assume it's a bug in the metadata, and should just be ignored.

divergentdave commented 10 years ago

Whoops, good catch. Dates in PDFs are supposed to follow ASN.1 apparently, but of course tons of them don't. The D: prefix is normal, but the rest of it specifies midnight, Universal Time, January 1, 0000, so I'm going to assume it's garbage. pdfinfo is already doing some formatting for PDFs that have valid ASN.1 format dates, so they are human-readable by the time they get to us.

On Sun, Jul 27, 2014 at 9:42 PM, Eric Mill notifications@github.com wrote:

Er, well, can't re-open a PR, but I'm going to assume it's a bug in the metadata, and should just be ignored.

— Reply to this email directly or view it on GitHub https://github.com/unitedstates/inspectors-general/pull/88#issuecomment-50295498 .