rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'
Apache License 2.0
383 stars 12 forks source link

Tracking: Web Scraping Ingestion Pipeline #384

Open rmusser01 opened 1 month ago

rmusser01 commented 1 month ago

Issue is to track efforts to improve the web scraping pipeline.

Other Scraper Implementations:

rmusser01 commented 3 weeks ago

https://github.com/unclecode/crawl4ai

rmusser01 commented 1 week ago

https://github.com/JohannesKaufmann/html-to-markdown