unclecode / crawl4ai

πŸ”₯πŸ•·οΈ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.63k stars 1.23k forks source link

Error scraping some pages #150

Closed b-sai closed 1 month ago

b-sai commented 1 month ago

The following code:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78")

asyncio.run(main())

Returns:

Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.
[LOG] 🌀️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] πŸ•ΈοΈ Crawling https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78 using AsyncP
laywrightCrawlerStrategy...
[LOG] βœ… Crawled https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78 successfully!
[LOG] πŸš€ Crawling done for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, success: True, time taken: 0.60 seconds
[ERROR] 🚫 Failed to crawl https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, error: Failed to extract content from the website: https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, error: can only concatenate str (not "NoneType") to str

Would love to know how I can fix it.

Making a simple python get request seems to return the HTML so it doesnt seem to be an issue with not being able to access due to headers and such

unclecode commented 1 month ago

Hi, @b-sai thank you for using our library. Let me show you how you can handle this one. There are certain websites that render their content totally on the client side. It means that when the initial page is loaded, they then call the backend, get the JSON data, and then render the page. This page, along with the links that you are searching for, is one of them. There are different ways that you can handle this with our library. One easy way is to use wait_for to wait for a certain criteria to occur, like a specific element on the page to be constructed, and then you can crawl the page, for example. Here's the following code that I created for you. When I checked the website that you shared with me, I understood that there is just one HTML element, which is the wrapper for the content. This HTML element has an id overview. So, in the following code, I ensure that this element is already created in the page before we extract the content.

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78",
            bypass_cache=True,
            wait_for = "css:#overview"
        )

Outpur:

[LOG] 🌀️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG]  πŸ•ΈοΈ Crawling https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78 using AsyncPlaywrightCrawlerStrategy...
[LOG] βœ… Crawled https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78 successfully!
[LOG] πŸš€ Crawling done for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, success: True, time taken: 1.88 seconds
[LOG] πŸš€ Content extracted for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, success: True, time taken: 0.04 seconds
[LOG] πŸ”₯ Extracting semantic blocks for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, Strategy: AsyncWebCrawler
[LOG] πŸš€ Extraction done for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, time taken: 0.04 seconds.

if you run this, you won't encounter any errors and you can also create different criteria you like, passing a JS functions, allowing you to execute a function return boolean, and Crawl4AI wait for the function to get true value. You may also use the hooks we have implemented in the crawler, which run at a certain point in the crawling timeline. Then, you can inject your code. For example, you can apply a custom delay. This is also another solution. There are multiple ways you can achieve this. If you go to the examples folder in our documentation, you'll find them, or if you have any questions, please feel free to ask here.

async def on_execution_started(page):
    await asyncio.sleep(2)  # Wait for the page to load
    # Other tasks you may do here

async with AsyncWebCrawler(verbose=True) as crawler:
    crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
    # Rest of code...

I hope this will be helpful for you, by the way. Thank you so much for reporting this issue. We try to make our error messages a little bit better, with suggestions, and in the next versions, we're going to add some smart solutions to detect such situations and apply relevant waiting time delays. We don't want to do this as a default because our goal is to have a very fast crawling. Not all websites are like this, but we're working on a way to detect this issue, and for now, wait_for is your solution.

b-sai commented 1 month ago

Thanks so much for this guidance, super helpful and very thorough! The smart detection would be super useful!

Mahizha-N-S commented 1 month ago

I tried your solution, and this is my console [ERROR] 🚫 Failed to crawl https://mercedes-benz.com/en/, error: Wait condition failed: Timeout after 30000ms waiting for selector '#overview' {"level":"ERROR","time":"Tue Oct 15 2024 16:36:16 IST+0530","name":"FastAPI Python Server For ","msg":"Error in crawling https://mercedes-benz.com/en/, Wait condition failed: Timeout after 30000ms waiting for selector '#overview'"}

what about these type of website?

[LOG] 🌀️ Warming up the AsyncWebCrawler [LOG] 🌞 AsyncWebCrawler is ready to crawl [LOG] πŸ•ΈοΈ Crawling https://mercedes-benz.com/en/ using AsyncPlaywrightCrawlerStrategy... [LOG] βœ… Crawled https://mercedes-benz.com/en/ successfully! [LOG] πŸš€ Crawling done for https://mercedes-benz.com/en/, success: True, time taken: 7.95 seconds [ERROR] 🚫 Failed to crawl https://mercedes-benz.com/en/, error: Failed to extract content from the website: https://mercedes-benz.com/en/, error: can only concatenate str (not "NoneType") to str {"level":"ERROR","time":"Tue Oct 15 2024 16:31:02 IST+0530","name":"FastAPI Python Server For","msg":"Error in crawling https://mercedes-benz.com/en/, Failed to extract content from the website: https://mercedes-benz.com/en/, error: can only concatenate str (not "NoneType") to str"}

I am using this in LLMStrategy method with instruction

and how to handle to stop when it takes too much time?, so that maybe i can switch the service?

unclecode commented 1 month ago

@Mahizha-N-S Would you please share the code, so I try to replicate the exact case. Thx