rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM
Apache License 2.0
44 stars 2 forks source link

Improvement: Add ability to ingest PDF Documents #46

Open rmusser01 opened 1 month ago

rmusser01 commented 1 month ago

As a user, I would like to be able to select / upload a PDF document, have the text content of the document extracted, chunked(if necessary), and then summarized appropriately. (And ingested into the DB with the option for adding keywords to the document)

PDF Tools https://github.com/VikParuchuri/surya https://github.com/nlmatics/llmsherpa https://github.com/Stirling-Tools/Stirling-PDF https://www.pdftool.org/en https://github.com/VikParuchuri/marker https://blog.dagworks.io/p/containerized-pdf-summarizer-with https://ai.gopubby.com/demystifying-pdf-parsing-02-pipeline-based-method-82619dbcbddf?gi=5de928644ec4 https://github.com/tesseract-ocr/tesseract https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/ https://github.com/apache/tika

http://mccormickml.com/2024/01/30/summarizing-long-pdfs-with-chatgpt/ https://github.com/nlmatics/nlm-ingestor https://github.com/Filimoa/open-parse - extract tables

https://archive.is/p0cLQ

rmusser01 commented 1 week ago

https://artifex.com/blog/rag-llm-and-pdf-conversion-to-markdown-text-with-pymupdf https://pypi.org/project/pymupdf4llm/ https://github.com/pymupdf/PyMuPDF https://pymupdf4llm.readthedocs.io/en/latest/

rmusser01 commented 1 week ago

https://unstract.com/blog/comparing-approaches-for-using-llms-for-structured-data-extraction-from-pdfs/ https://unstract.com/blog/pdf-hell-and-practical-rag-applications/ https://neuml.github.io/txtai/usecases/#retrieval-augmented-generation https://github.com/Zipstack/unstract https://github.com/Filimoa/open-parse