jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.88k stars 240 forks source link

Explain how to read text data from PDF and PowerPoint and use it with Texthero #24

Open selimelawwa opened 4 years ago

selimelawwa commented 4 years ago

PDF, PowerPoint presentations and other unstructured text, contain very valuable data that can be used for analysis. There are many tools providing this features. It would be nice if we can provide a single method to read such files and don't bother user with this.

There is a python library textract provide this functionality unfortunately it is not maintained.

We can provide a method loadData or so that has different implementation depending on file type

jbesomi commented 4 years ago

Very interesting comment. Completely agree that we should do something related.

textract There are also other python tools for PDF extraction such as PyPDF2, PDFminer, etc.

dataLoader as the use cases are quite different from task to task and also as this feature is a bit too far from the core idea of texthero, an alternative would be to add a detailed tutorial on the blog with also snippet of code (that can also be added somewhere in the github repo) that explain how to extract text data from different sources such as PDF and PowerPoint. What do you think about this? Also, having a universal dataLoader might be quite hard and that's why there is in general a custom python package that does only that.

As a final comment, it's important to define precisely what are the goals and objective of texthero, better doing one thing great than 5 average. We can discuss also that eventually.

selimelawwa commented 4 years ago

Completely agree with your final comment ! Even though this is not one of the core goals of texthero, but I think it can be a cool feature to have. Just wanted to write it down so it can be made later on after core is built and running. I think having ideas written down / shared is good for the project.

My idea for a universal data loader is that it appears as "universal" to the user, however it will have multiple implementations and can use different packages under-hood depending on file type / data source.

For now yeah we can just have a tutorial on the blog!

igponce commented 4 years ago

There's a good library TIKA-Python (https://github.com/chrismattmann/tika-python) that handles PDFs, emails, and other formats as well. It is based on apache tika (http://tika.apache.org/) and the maintainer is on the Apache Tika board.

The only con I find is that it needs a JVM to run TIKA behind the scenes; but it's very easy to start using it:


import tika
from tika import parse, 

tika.initVM()  # Gets apache-tika jar file (if not present) and lauch tika from the JVM

filename = 'path/to/your/file(ppt|doc|docx|pdf)'
thedoc= parse(filename)

print( thedoc['metadata'] ) #  dict with information about the file itself
print(thedoc['content'])  # Output utf8 text from the file

# Dump attachments if the file has any (like .msg, .eml, etc).

if thedoc.get('attachments',False):
   print(thedoc['attachments'])
jbesomi commented 4 years ago

Hi @igponce! Thank you for your comment!

Adding native PDF support might be a bit out of Texthero's purposes.

What it's definitely useful is to have a tutorial on the Texther's blog page that explains how to start hero-analyzing a collection of documents, starting from raw and other formats.

There are different solutions for doing that, another valid alternative is for instance to use pdfminer.six as it's very simple to use and it's based only on python (no need for the JVM).

For example, to go from raw pdf data to a Pandas Dataframe this line of code does the job:

import glob
from pdfminer.high_level import extract_text

all_pdf = glob.glob("filepath_to_pdf_collection/*.pdf")
text = [extract_text(p) for p in all_pdf]
df = pd.DataFrame(text_review, columns=['text'])

.. do hero analysis

Would you be interested in writing such a blog post? It would be great to show how to go from raw data to Pandas/Hero using different tools, including Apache Tika and Pdfminer, Textract, ...

regards,

igponce commented 4 years ago

Good point on getting PDF etc. out of scope: it's vert tempting to add stuff; but hard to leave it aout. I'll send you a draft, just after I make some experiments myself. Maybe next week.

jbesomi commented 4 years ago

Sounds amazing! Looking forward to that!