improve table extraction from PDFs

h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

http://h2o.ai

Apache License 2.0

10.96k stars 1.2k forks source link

improve table extraction from PDFs #702

Open pseudotensor opened 10 months ago

pseudotensor commented 10 months ago

https://github.com/camelot-dev/camelot

pip install tabula-py

https://pypi.org/project/tabula-py/ tabula windows 10: https://tabula-py.readthedocs.io/en/latest/getting_started.html#get-tabula-py-working-windows-10 https://javadl.oracle.com/webapps/download/AutoDL?BundleId=248774_8c876547113c4e4aab3c868e9e0ec572 https://github.com/chezou/tabula-py/issues/195

pymupdf etc. all do poorly.

pseudotensor commented 10 months ago

Windows:

import os
os.environ['PATH'] = os.environ['PATH'] + ';' + r'C:\Program Files\Java\jre-1.8\bin'

import tabula
pdf = "c:/users/pseud/Downloads/whisper_test.pdf"
dfs = tabula.read_pdf(pdf, stream=True, pages="all")
print(len(dfs))
print(dfs[0])

Gives nice tables back, e.g.:

ffalkenberg commented 10 months ago

Improving table extraction from PDFs would indeed be a significant enhancement. Integrating such functionality, especially one that can work seamlessly with OCR for image-based tables, would be amazing!

pseudotensor commented 10 months ago

An ok solution is to use the 'maximum quality parsing' option, which will do all parsings including OCR unstructured option. This helps.