rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM
Apache License 2.0
44 stars 2 forks source link

Improvement: Add support for extracting text from different sources #25

Open rmusser01 opened 1 month ago

rmusser01 commented 1 month ago

As a user, I would like to be able to select / upload a document, have the text content of the document extracted, chunked, and then summarized appropriately.

I would like to be able to do this with multiple document types, including word document formats, PDF, and EPUB.

As a user of the GUI, I should be able to access the UI of the application, and select a file for upload, at which point the file is uploaded and parsed, confirmed to be an appropriate/matching file type, and the text then extracted, (chunked if the arg is passed) and finally summarized.

As a user of the CLI, I should be able to pass a command line argument that allows me to specify a single file, or a collection of files, as listed in a text file, as input for summarization (with the option for chunking if the arg is passed).

I figure since we're 'just' shuffling text back and forth, why not throw in some other text formats as well, to make this even more handy of a tool for research and study.

~Tracking integration of Website text: Issue #43~ Solved.

Tracking integration of PDF documents: Issue #46

Tracking integration of epub files: #47

Tracking integration of Office doc files: Issue #44

rmusser01 commented 1 month ago

https://realpython.com/natural-language-processing-spacy-python/