Continuing Improvement: Functionality of PDF Ingestion / Chunking / Analysis

rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM (eventually)

Apache License 2.0

127 stars 5 forks source link

Issue is to track efforts of parsing PDFs and any articles/documents relating to this.

Currently 'marker' is used https://github.com/VikParuchuri/marker This requires a separate venv and I have done no benchmarking of its effectiveness. Further, Marker itself is a pipeline and not a self-contained solution. Further tuning of marker should be investigated.

Articles

Enhancements to the pipeline:

https://huggingface.co/yifeihu/TFT-ID-1.0

Alternatives to look at:

https://github.com/opendatalab/PDF-Extract-Kit

Models

https://huggingface.co/Aryn/deformable-detr-DocLayNet

Unsorted https://github.com/pymupdf/RAG https://github.com/pdf2htmlEX/pdf2htmlEX https://news.ycombinator.com/item?id=41072632 https://camelot-py.readthedocs.io/en/master/ https://pymupdf4llm.readthedocs.io/en/latest/ https://arxiv.org/abs/2407.08488

Setting up a Pipeline for research papers: https://github.com/kermitt2/grobid

Setting up a pipeline for TTRPG books:

Setting up a pipeline for ebook PDFs

Unsorted list https://pypi.org/project/pypdf/

rmusser01 / tldw

Continuing Improvement: Functionality of PDF Ingestion / Chunking / Analysis #93