How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function

Mahizha-N-S commented 1 month ago

while crawling multiple urls, how does the crawler handle the web url which is not found in net(404 Page not FOund)

Mahizha-N-S commented 1 month ago

after working through my project i found this is how the response if for unresolved webpages


[ERROR] 🚫 Failed to crawl https://www.whattodonowiamnotapersonseemee.com/, error: Failed to crawl https://www.whattodonowiamnotapersonseemee.com/: Page.goto: net::ERR_NAME_NOT_RESOLVED at https://www.whattodonowiamnotapersonseemee.com/
Call log:
navigating to "https://www.whattodonowiamnotapersonseemee.com/", waiting until "domcontentloaded"

{"level":"ERROR","time":"Wed Oct 16 2024 12:18:20 IST+0530","name":"FastAPI Python Server","msg":"Error in crawling https://www.whattodonowiamnotapersonseemee.com/, Failed to crawl https://www.whattodonowiamnotapersonseemee.com/: Page.goto: net::ERR_NAME_NOT_RESOLVED at https://www.whattodonowiamnotapersonseemee.com/
Call log:
navigating to "https://www.whattodonowiamnotapersonseemee.com/", waiting until "domcontentloaded"
"}

and this response when we cant extract the html content

[ERROR] 🚫 Failed to crawl https://mercedes-benz.com, error: Process HTML, Failed to extract content from the website: https://mercedes-benz.com, error: can only concatenate str (not "NoneType") to str
{"level":"ERROR","time":"Wed Oct 16 2024 12:17:32 IST+0530","name":"FastAPI Python Server  ","msg":"Error in crawling https://mercedes-benz.com, Process HTML, Failed to extract content from the website: https://mercedes-benz.com, error: can only concatenate str (not "NoneType") to str"}
{"level":"INFO","time":"Wed Oct 16 2024 12:18:22 IST+0530","name":"FastAPI Python Server ","msg":"Skipping URL: https://mercedes-benz.com due to empty content."}

if its in arun_many, will this result make the enitre crawler function in error, can we have like a list which says failed_url along with the reason,

And also during crawling, it seems the token limit exceeded in my model, which resulted in infinite loop of the crawling in my program, so i added a exception for it Hope this gives ideas for enhancement!

unclecode commented 1 month ago

@Mahizha-N-S Thx for the suggestion, appreciate it. For pages that do not exist, like 404, there are two situations. The success is true for the return result, but the content is whatever that website returns. Because not all websites always return the status code of the 404, but the status code is also a part of the result, so you can filter based on the status code. Another thing is that the latest version has this page timeout parameter. That means you can set the page time-out and change it to any amount that you want. Regarding the token limit, I don't understand if you share the code snippet. I can try it on my end.

Mahizha-N-S commented 1 month ago

@unclecode Thanks for the reply, I got what you are trying to say regarding time-out, is this updated in the docs example for reference?, and regarding the model limit exceeded, i meant if i use Groq or any other token-limited provider, and if there are many urls to scarp, in terminal i observed that the error log was in a loop, So maybe if error in model during arun_many, we can catch the exception? , This was just what i observed, hope it understands

unclecode commented 1 month ago

@Mahizha-N-S We will update the docs and example by next week, after releasing 0.3.72, and thx for explanation. And now it's clear to me that this is something we have to check. You have a point here. The desired outcome is that no matter what, the process of crawling should come to an end. Of course, the success flag can be true or false. And if it's false, you'll have the error message - what's happened. It shouldn't go to an infinite loop, or anything else. Whether it's crawling one URL or crawling multiple. Thanks for sharing this. We'll check and update this.

Mahizha-N-S commented 1 month ago

@unclecode , Thanks for the update, looking forward to it!!!

unclecode commented 1 month ago

You're welcome @Mahizha-N-S

unclecode / crawl4ai

How does the crawl4ai handle the pages not Found and time-limit-exceeded results in loop of the crawl function #145