uscensusbureau / fismatic

https://github.com/uscensusbureau/fismatic/projects/1
Other
11 stars 10 forks source link

try Apache Tika #25

Closed afeld closed 5 years ago

afeld commented 5 years ago

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

https://tika.apache.org/

https://blog.ouseful.info/2015/02/09/getting-text-of-anything-docs-pdfs-images-using-apache-tika/

afeld commented 5 years ago

Given that it's Java, I'm not sure that it's worth the complexity of bridging between two languages. Putting aside for now.