rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'
Apache License 2.0
330 stars 11 forks source link

Improvement: Add ability to ingest .epub format books #47

Closed rmusser01 closed 4 months ago

rmusser01 commented 5 months ago

As a user, I would like to be able to select / upload an epub document/book, have the text content of the document extracted, and ingested into the database with keyword tagging.

I would further like ability to search across it and use it for some sort of Q/A RAG thing in the future. (Will split off into its own issue once epub ingestion is supported)

Calibre cli: (Calibre CLI -> convert X book format to raw text -> Ingest into DB) https://manual.calibre-ebook.com/generated/en/ebook-convert.html https://manual.calibre-ebook.com/conversion.html#conversion

Existing projects: https://github.com/Medusa-ML/Book-Summarizer/tree/main

https://unix.stackexchange.com/questions/647686/convert-epub-to-txt-and-preserve-original-formatting

rmusser01 commented 4 months ago

Will use pandoc, have the user supply the converted txt file: https://pandoc.org/MANUAL.html#epubs pandoc -f epub -t plain -o filename.txt filename.epub

rmusser01 commented 4 months ago

Closing this as I don't want to integrate pandoc right now. Don't feel its a big ask considering everything else necessary to get this working...