rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM
Apache License 2.0
44 stars 2 forks source link

Enhancemen: Fix Text Ingestion pipeline #51

Closed rmusser01 closed 1 month ago

rmusser01 commented 1 month ago

This issue is to track the Text ingestion pipeline's working status.

Issues:

  1. Title is not properly grabbed from articles when ingesting. -> DONE
  2. Add the option for manually inserting a title for articles. - Implemented but not working
  3. Add the option for manually inserting a title for unstructured text. - Implemented but not working
  4. Add bot protection bypass/mitigation; -> DONE (simple fix, not long-term, but it works....)
  5. Potentially look at headless browser for article scraping? -> Maybe using current user's cookies/session tokens if applicable? -> DONE (Archive box has a full setup + multiplatform, can rely on them for the client-side of things, look at firecrawl or similar for the 'server' version)
rmusser01 commented 1 month ago

All Done.