rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'
Apache License 2.0
330 stars 11 forks source link

Enhancemen: Fix Text Ingestion pipeline #51

Closed rmusser01 closed 5 months ago

rmusser01 commented 5 months ago

This issue is to track the Text ingestion pipeline's working status.

Issues:

  1. Title is not properly grabbed from articles when ingesting. -> DONE
  2. Add the option for manually inserting a title for articles. - Implemented but not working
  3. Add the option for manually inserting a title for unstructured text. - Implemented but not working
  4. Add bot protection bypass/mitigation; -> DONE (simple fix, not long-term, but it works....)
  5. Potentially look at headless browser for article scraping? -> Maybe using current user's cookies/session tokens if applicable? -> DONE (Archive box has a full setup + multiplatform, can rely on them for the client-side of things, look at firecrawl or similar for the 'server' version)
rmusser01 commented 5 months ago

All Done.