mcs07 / ChemDataExtractor

Automatically extract chemical information from scientific documents
http://chemdataextractor.org
MIT License
305 stars 113 forks source link

Consider tika-python for text extraction? #26

Open chrismattmann opened 4 years ago

chrismattmann commented 4 years ago

Hi,

Not sure how you are doing text extraction, but just saw an article in IEEE computing edge that cited your tool. If you have any interested in Apache Tika we provide a functional Python library that you could leverage. Does pdfminer also do the text extraction part?

The benefit of Tika is that it supports text extraction from 1400+ formats.

Cheers, Chris