As a user, I would like to be able to select / upload a PDF document, have the text content of the document extracted, chunked(if necessary), and then summarized appropriately. (And ingested into the DB with the option for adding keywords to the document)
As a user, I would like to be able to select / upload a PDF document, have the text content of the document extracted, chunked(if necessary), and then summarized appropriately. (And ingested into the DB with the option for adding keywords to the document)
Tool I used: https://github.com/VikParuchuri/marker
Document Layout Analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis
OCR-related: https://github.com/VikParuchuri/surya https://github.com/nlmatics/llmsherpa https://github.com/tesseract-ocr/tesseract https://github.com/UglyToad/PdfPig https://github.com/Filimoa/open-parse https://pypi.org/project/pymupdf4llm/ & https://pymupdf4llm.readthedocs.io/en/latest/ https://github.com/pymupdf/PyMuPDF
PDF Tools https://www.pdftool.org/en https://github.com/Stirling-Tools/Stirling-PDF
Langchain https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/
Other that seemed interesting but didn't test: https://github.com/apache/tika