Open pseudotensor opened 10 months ago
Windows:
import os
os.environ['PATH'] = os.environ['PATH'] + ';' + r'C:\Program Files\Java\jre-1.8\bin'
import tabula
pdf = "c:/users/pseud/Downloads/whisper_test.pdf"
dfs = tabula.read_pdf(pdf, stream=True, pages="all")
print(len(dfs))
print(dfs[0])
Gives nice tables back, e.g.:
Improving table extraction from PDFs would indeed be a significant enhancement. Integrating such functionality, especially one that can work seamlessly with OCR for image-based tables, would be amazing!
An ok solution is to use the 'maximum quality parsing' option, which will do all parsings including OCR unstructured option. This helps.
https://github.com/camelot-dev/camelot
https://pypi.org/project/tabula-py/ tabula windows 10: https://tabula-py.readthedocs.io/en/latest/getting_started.html#get-tabula-py-working-windows-10 https://javadl.oracle.com/webapps/download/AutoDL?BundleId=248774_8c876547113c4e4aab3c868e9e0ec572 https://github.com/chezou/tabula-py/issues/195
pymupdf etc. all do poorly.