Improvement: Add ability to ingest PDF Documents

rmusser01 commented 5 months ago

As a user, I would like to be able to select / upload a PDF document, have the text content of the document extracted, chunked(if necessary), and then summarized appropriately. (And ingested into the DB with the option for adding keywords to the document)

Tool I used: https://github.com/VikParuchuri/marker

Document Layout Analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis

PDF Tools https://www.pdftool.org/en https://github.com/Stirling-Tools/Stirling-PDF

Langchain https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/

Other that seemed interesting but didn't test: https://github.com/apache/tika

rmusser01 commented 4 months ago

Can't just use marker-pdf, deps are in conflict, will have to do a dirty hack of having a separate venv and callout to it for usage.

rmusser01 commented 4 months ago

Closing since marker should work, and I've confirmed text file ingestion.

rmusser01 / tldw

Improvement: Add ability to ingest PDF Documents #46