rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'
Apache License 2.0
330 stars 11 forks source link

Improvement: Add ability to ingest PDF Documents #46

Closed rmusser01 closed 4 months ago

rmusser01 commented 5 months ago

As a user, I would like to be able to select / upload a PDF document, have the text content of the document extracted, chunked(if necessary), and then summarized appropriately. (And ingested into the DB with the option for adding keywords to the document)

Tool I used: https://github.com/VikParuchuri/marker

Document Layout Analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis

OCR-related: https://github.com/VikParuchuri/surya https://github.com/nlmatics/llmsherpa https://github.com/tesseract-ocr/tesseract https://github.com/UglyToad/PdfPig https://github.com/Filimoa/open-parse https://pypi.org/project/pymupdf4llm/ & https://pymupdf4llm.readthedocs.io/en/latest/ https://github.com/pymupdf/PyMuPDF

PDF Tools https://www.pdftool.org/en https://github.com/Stirling-Tools/Stirling-PDF

Langchain https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/

Other that seemed interesting but didn't test: https://github.com/apache/tika

rmusser01 commented 4 months ago

Can't just use marker-pdf, deps are in conflict, will have to do a dirty hack of having a separate venv and callout to it for usage.

rmusser01 commented 4 months ago

Closing since marker should work, and I've confirmed text file ingestion.