rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'
Apache License 2.0
330 stars 11 forks source link

Improvement: Add ability to ingest website articles and free form text input as an 'Article' form of media for ingestion #43

Closed rmusser01 closed 5 months ago

rmusser01 commented 5 months ago

As a user, I would like to be able to input a URL and have the corresponding web page scraped for an article, displayed to the user for confirmation, an option for tagging said article provided, the option to have said article summarized, and finally the article summary displayed to the user if selected, and stored in the DB along with the tags and article text; or I input a block of unstructured text into a text input area, and have the following operations performed on it:

  1. summarizaiton of said text;
  2. Ingestion of said text as an 'Article' type of media into the DB, with a User-provided Name/Author, otherwise the 'Name' value is the ingestion time, and the 'Author' value is 'None';
  3. Adding of keywords to said raw text for tagging upon ingestion.

This option should be present in both the GUI and CLI.

rmusser01 commented 5 months ago

Websites trafilatura - https://trafilatura.readthedocs.io/en/latest/quickstart.html https://huggingface.co/spaces/Paul-Joshi/website-summarizers-RAG/blob/main/app.py https://www.tldrthis.com/ https://github.com/seandearnaley/reddit-gpt-summarizer https://medium.com/unstructured-io/summarize-webpages-in-ten-lines-of-code-with-unstructured-langchain-ce257cc1726d https://github.com/VikParuchuri/surya

rmusser01 commented 5 months ago

Hesitantly done.