rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM (eventually)
Apache License 2.0
127 stars 5 forks source link

Continuing Improvement: Functionality of PDF Ingestion / Chunking / Analysis #93

Open rmusser01 opened 2 months ago

rmusser01 commented 2 months ago

Issue is to track efforts of parsing PDFs and any articles/documents relating to this.

Currently 'marker' is used https://github.com/VikParuchuri/marker This requires a separate venv and I have done no benchmarking of its effectiveness. Further, Marker itself is a pipeline and not a self-contained solution. Further tuning of marker should be investigated.

Articles

Enhancements to the pipeline:

Alternatives to look at:

Models

Unsorted https://github.com/pymupdf/RAG https://github.com/pdf2htmlEX/pdf2htmlEX https://news.ycombinator.com/item?id=41072632 https://camelot-py.readthedocs.io/en/master/ https://pymupdf4llm.readthedocs.io/en/latest/ https://arxiv.org/abs/2407.08488

Setting up a Pipeline for research papers: https://github.com/kermitt2/grobid

Setting up a pipeline for TTRPG books:

Setting up a pipeline for ebook PDFs

Unsorted list https://pypi.org/project/pypdf/

rmusser01 commented 1 month ago

Tables Extraction https://ai.gopubby.com/advanced-rag-retrieval-strategy-embedded-tables-fdb3e44003a5?gi=3a05c418031c https://pypi.org/project/pypdf/ https://github.com/pdfminer/pdfminer.six https://github.com/camelot-dev/camelot https://github.com/aurelio-labs/semantic-chunkers/tree/main https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools https://medium.com/intel-tech/tabular-data-rag-llms-improve-results-through-data-table-prompting-bcb42678914b https://opea.dev/ https://github.com/opendatalab/MinerU

https://unstract.com/blog/extract-table-from-pdf/ https://replicate.com/cuuupid/glm-4v-9b https://www.idrsolutions.com/online-pdf-to-html5-converter https://www.datature.io/blog/introducing-florence-2-microsofts-latest-multi-modal-compact-visual-language-model https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

https://www.reddit.com/r/LocalLLaMA/comments/17p31mc/rag_flexible_context_retrieval_around_a_matching/ https://www.reddit.com/r/Rag/comments/1f0q2b7/rethinking_markdown_splitting_for_rag_context/ https://github.com/docprompt/Docprompt https://github.com/emcf/thepipe