soprasteria / cybersecurity-dfm

Data Feed Manager (news watch orchestrator to predict topic with deepdetect and store cleaned text in elasticsearch)
GNU General Public License v3.0
40 stars 14 forks source link

Support file text extraction #2

Closed acabrol closed 6 years ago

acabrol commented 6 years ago

Currently DFM only extract content from webpage.

In ToDo list we expect to extract text from download file also like pdf, doc, docx, ppt, pptx, odt, odp.

Link below is an idea to detect file format: https://stackoverflow.com/questions/38710238/python-download-file-over-http-and-detect-filetype-automatically

Text extraction from several type of documents: http://textract.readthedocs.io/en/stable/

Textract require a file object which can be created with: https://docs.python.org/2/library/tempfile.html