rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM (eventually)
Apache License 2.0
128 stars 6 forks source link

Enhancement: Add recursive scraping somehow? #158

Closed rmusser01 closed 1 week ago

rmusser01 commented 1 month ago

basically scrape phrack archive as an example and turn the articles into tagged entries; Maybe create a script to parse the links and then dump each set into it?

Solution:

  1. Crawling+ archiving all content under said URL (with configurable top-level URL, so you can scrape google.com vs google.com/blog/ - this should handle the phrack usecase....)
  2. Option to use X tool to generate a sitemap(also with configurable top-level URL) , and then crawl+archive all content noted in the site map
rmusser01 commented 3 weeks ago

Kicking this to v5. This is going to require a little research + work.

rmusser01 commented 1 week ago

Revisiting, 3 different features here,

  1. ezpz, just do depth crawling limit of say X (user-configurable) and say all of that text is one entry, have separations/page names indicating demarcation. Use tags for saying its a collection of X site.
  2. similar to above but unique identification of articles.
  3. Generate and use a site map.

So, will offer two new options:

  1. sitemap creation + archiving all content under said sitemap (with configurable top-level URL, so you can scrape google.com vs google.com/blog/ - this should handle the phrack usecase....)
  2. Option to use X tool to generate a sitemap, and then crawl+archive all content noted in the site map
rmusser01 commented 1 week ago

https://github.com/Skyvern-AI/skyvern/blob/0d39e62df6c516e0aaf14e570139d12ca86cebfe/skyvern/webeye/scraper/domUtils.js#L66 https://github.com/philippe2803/contentmap https://github.com/c4software/python-sitemap

rmusser01 commented 1 week ago

Need to test/tweak, but its in.