Closed QuangTQV closed 2 weeks ago
Hi, @QuangTQV thank you for using the library. You're right, which is why we're dropping a scrapper module soon, prioritizing efficiency. We support different algorithms and strategies, bringing various filters and techniques for optimized crawling and scraping. For now, it's under testing and review, but we hope to have it ready within weeks, likely before year-end. Currently, we have a perfect way to crawl any URL and produce a good markdown. We also have the ability to crawl multiple URLs simultaneously via parallelism, though this isn't full scraping. The upcoming scrapper will address issues like duplication avoidance and speed optimization, as well as memory and CPU usage. Stay tuned for its release.
Is there an optimized way to scrape linked URLs on a website, ensuring that only necessary links are scraped (same domain, not scraping ad links), and avoiding duplicate links (if a link has already been scraped, do not scrape it again)? The speed must be optimized.