unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Extract text from Microsoft Word documents (pre-Word 2007) #113

Closed divergentdave closed 10 years ago

divergentdave commented 10 years ago

I've noticed some .doc reports from IG's, and it would be good to extract text from them. This could use Abiword's command line interface, or LibreOffice's unoconv. As a bonus, unoconv could turn the report into a PDF too, if we'd like to standardize on that.

Edit: And it looks like we can extract metadata, including the last modified time, using the venerable file.

audiodude commented 10 years ago

Just saw https://github.com/deanmalmgren/textract

Maybe that's worth looking into?

konklone commented 10 years ago

Beat you by 5 minutes on this in #91, @audiodude.

konklone commented 10 years ago

Closing in favor of resolving #141.