unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.38k stars 1.2k forks source link

Scraping linked URLs #241

Closed QuangTQV closed 2 weeks ago

QuangTQV commented 2 weeks ago

Is there an optimized way to scrape linked URLs on a website, ensuring that only necessary links are scraped (same domain, not scraping ad links), and avoiding duplicate links (if a link has already been scraped, do not scrape it again)? The speed must be optimized.

unclecode commented 2 weeks ago

Hi, @QuangTQV thank you for using the library. You're right, which is why we're dropping a scrapper module soon, prioritizing efficiency. We support different algorithms and strategies, bringing various filters and techniques for optimized crawling and scraping. For now, it's under testing and review, but we hope to have it ready within weeks, likely before year-end. Currently, we have a perfect way to crawl any URL and produce a good markdown. We also have the ability to crawl multiple URLs simultaneously via parallelism, though this isn't full scraping. The upcoming scrapper will address issues like duplication avoidance and speed optimization, as well as memory and CPU usage. Stay tuned for its release.