unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Extract text and metadata from .doc files #141

Closed divergentdave closed 10 years ago

divergentdave commented 10 years ago

This uses abiword and file to extract text and metadata from Microsoft Word, 2003 and earlier. See also #113.

konklone commented 10 years ago

Nice! What should I test this on?

divergentdave commented 10 years ago

There are a few docs in the education IG, years 2002 and 2003. On Aug 17, 2014 5:30 PM, "Eric Mill" notifications@github.com wrote:

Nice! What should I test this on?

— Reply to this email directly or view it on GitHub https://github.com/unitedstates/inspectors-general/pull/141#issuecomment-52437608 .

konklone commented 10 years ago

I fixed up the file and abiword commands to run with shell=False under Linux in d15d10d and 6c81f13. It works fantastically, thanks @divergentdave!

divergentdave commented 10 years ago

Oh right, good catch. Thanks!