unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.27k stars 1.2k forks source link

[DOUBT] Performance expectations #115

Open Sahil-Gulihar opened 1 month ago

Sahil-Gulihar commented 1 month ago

Does it take advantage of multi threading or something?

unclecode commented 1 month ago

In the new version, everything has been aligned with asynchronous operations. We are using Playwright, and the result is amazing, perhaps the best way to achieve super speed. However, we are currently working on the scraping engine, which promises to be a significant undertaking. We are investing time to optimize multi-processing, multi-threading, and asynchronous techniques to create a robust system. For now you can use arun_many function to simply use aync I/O parallelism. Anyway stay tuned for the upcoming release of the scraper, which will deliver a remarkable multitasking experience for crawling numerous links or entire websites.

Right now you can try something like this:

# File: async_webcrawler_multiple_urls_example.py

import asyncio
from crawl4ai import AsyncWebCrawler, NoExtractionStrategy, RegexChunking

async def main():
    # Initialize the AsyncWebCrawler
    async with AsyncWebCrawler(verbose=True) as crawler:
        # List of URLs to crawl
        urls = [
            "https://example.com",
            "https://python.org",
            "https://github.com",
            "https://stackoverflow.com",
            "https://news.ycombinator.com"
        ]

        # Set up crawling parameters
        word_count_threshold = 100

        # Run the crawling process for multiple URLs
        results = await crawler.arun_many(
            urls=urls,
            word_count_threshold=word_count_threshold,
            bypass_cache=True,
        )

        # Process the results
        for result in results:
            if result.success:
                print(f"Successfully crawled: {result.url}")
                print(f"Title: {result.metadata.get('title', 'N/A')}")
                print(f"Word count: {len(result.markdown.split())}")
                print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
                print(f"Number of images: {len(result.media.get('images', []))}")
                print("---")
            else:
                print(f"Failed to crawl: {result.url}")
                print(f"Error: {result.error_message}")
                print("---")

if __name__ == "__main__":
    asyncio.run(main())
Mahizha-N-S commented 1 month ago

Hey Nice to know this function, I just have a small doubt in the implementation of this arun_many, what if i want to write the extracted_content of all the urls in my docs and store the failed scarp url into a list and pass on to another function, For example in this mannerism:

try:
        logger.info(f"Urls:{urls}")
        async with AsyncWebCrawler(verbose=True) as crawler:
            strategy = LLMExtractionStrategy(
                provider="groq/llama-guard-3-8b", 
                api_token=os.getenv('GROQ_API_KEY'), 
                schema=PageSummary.model_json_schema(),
                extraction_type="schema",
                apply_chunking=True,
                instruction=(......
                ),
                bypass_cache=True,
            )
            for url in urls:
                try:
                    result = await crawler.arun(
                        url=url,
                        word_count_threshold=100,
                        extraction_strategy=strategy,
                        bypass_cache=True,
                        process_iframe=True,
                    )

                    if result.success:
                        try:
                            extracted_content_json = json.loads(result.extracted_content)
                            save_to_word_incremental(extracted_content_json, url, docx_path)
                            remaining_urls.remove(url)
                            logger.info(f"remaining_urls: {remaining_urls}")
                        except json.JSONDecodeError as e:
                            logger.error(f"Failed to decode extracted content for {url}: {e}")
                    else:
                        logger.error(f"Error in crawling {url}, {result.error_message}")

                except Exception as crawl_error:
                    logger.error(f"Failed to crawl {url}, error: {crawl_error}--> redirecting to different webscraping service")
                    await scrap_webpages(remaining_urls, docx_path)
                    return docx_path
        return docx_path

    except Exception as e:
        logger.error(f"Unexpected error with webscraping service {e}")
        return docx_path

Its someting like this then can i just proceed with the replacment of the arun_many? in results we get all the details right?? So in that i should append all the extracted_content and add it to the doc? Am i right with this understand flow?

unclecode commented 1 month ago

@Mahizha-N-S If I understand your question, you can still call the arun_many method to perform multiple parallel crawling, which is similar to what you're doing here. Instead of using a four-loop, it utilizes parallelism with asynchronous operations. At the end of the day, you'll have a list of all crowd results, and you can check the success flag to see whether it's true or false. Those that are successful can be added to your file, while those that fail can be addressed using different solutions. Each failure also comes with a message property to explain what specifically went wrong. Let me know if this is helpful.