unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.58k stars 1.23k forks source link

Question: Scrapy #263

Closed luisferreira93 closed 1 week ago

luisferreira93 commented 1 week ago

Hello! Great work with crawl4ai 👍🏻
Is it possible to integrate crawl4ai with scrapy? Do you have any code samples?

Thank you in advance

luisferreira93 commented 5 days ago

Sample code:

from crawl4ai import AsyncWebCrawler
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class ExampleSpider(CrawlSpider):
    name = "scrapy_integration"
    start_urls = ["https://crawler-test.com/links/page_with_external_links"]
    allowed_domains = ["crawler-test.com"]
    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    async def parse_item(self, response):
        async with AsyncWebCrawler(verbose=False) as crawler:
            result = await crawler.arun(url=response.url,)
            print(result.markdown)
unclecode commented 5 days ago

@luisferreira93 Thanks for using the library and also appreciate the sample that you shared here for other people who may need that kind of help. FYI, we are releasing two important components by this year: One is an executor pipeline, which is very efficient and uses whatever resources are available to adaptively crawl multiple URLs at the same time. The second one is a scraper that gets the website and uses graph search algorithms to extract everything to all the layers of that website. I believe in that situation you can wrap up the whole process within the crawl4ai library. Please stay tuned and I will update you soon in my next account.