Closed b-sai closed 1 month ago
Hi, @b-sai thank you for using our library. Let me show you how you can handle this one. There are certain websites that render their content totally on the client side. It means that when the initial page is loaded, they then call the backend, get the JSON data, and then render the page. This page, along with the links that you are searching for, is one of them. There are different ways that you can handle this with our library. One easy way is to use wait_for
to wait for a certain criteria to occur, like a specific element on the page to be constructed, and then you can crawl the page, for example. Here's the following code that I created for you. When I checked the website that you shared with me, I understood that there is just one HTML element, which is the wrapper for the content. This HTML element has an id overview
. So, in the following code, I ensure that this element is already created in the page before we extract the content.
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78",
bypass_cache=True,
wait_for = "css:#overview"
)
Outpur:
[LOG] π€οΈ Warming up the AsyncWebCrawler
[LOG] π AsyncWebCrawler is ready to crawl
[LOG] πΈοΈ Crawling https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78 using AsyncPlaywrightCrawlerStrategy...
[LOG] β
Crawled https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78 successfully!
[LOG] π Crawling done for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, success: True, time taken: 1.88 seconds
[LOG] π Content extracted for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, success: True, time taken: 0.04 seconds
[LOG] π₯ Extracting semantic blocks for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, Strategy: AsyncWebCrawler
[LOG] π Extraction done for https://jobs.ashbyhq.com/writer/1c5617a6-3295-4333-83a0-346987e7be78, time taken: 0.04 seconds.
if you run this, you won't encounter any errors and you can also create different criteria you like, passing a JS functions, allowing you to execute a function return boolean, and Crawl4AI wait for the function to get true
value. You may also use the hooks we have implemented in the crawler, which run at a certain point in the crawling timeline. Then, you can inject your code. For example, you can apply a custom delay. This is also another solution. There are multiple ways you can achieve this. If you go to the examples folder in our documentation, you'll find them, or if you have any questions, please feel free to ask here.
async def on_execution_started(page):
await asyncio.sleep(2) # Wait for the page to load
# Other tasks you may do here
async with AsyncWebCrawler(verbose=True) as crawler:
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
# Rest of code...
I hope this will be helpful for you, by the way. Thank you so much for reporting this issue. We try to make our error messages a little bit better, with suggestions, and in the next versions, we're going to add some smart solutions to detect such situations and apply relevant waiting time delays. We don't want to do this as a default because our goal is to have a very fast crawling. Not all websites are like this, but we're working on a way to detect this issue, and for now, wait_for
is your solution.
Thanks so much for this guidance, super helpful and very thorough! The smart detection would be super useful!
I tried your solution, and this is my console [ERROR] π« Failed to crawl https://mercedes-benz.com/en/, error: Wait condition failed: Timeout after 30000ms waiting for selector '#overview' {"level":"ERROR","time":"Tue Oct 15 2024 16:36:16 IST+0530","name":"FastAPI Python Server For ","msg":"Error in crawling https://mercedes-benz.com/en/, Wait condition failed: Timeout after 30000ms waiting for selector '#overview'"}
what about these type of website?
[LOG] π€οΈ Warming up the AsyncWebCrawler [LOG] π AsyncWebCrawler is ready to crawl [LOG] πΈοΈ Crawling https://mercedes-benz.com/en/ using AsyncPlaywrightCrawlerStrategy... [LOG] β Crawled https://mercedes-benz.com/en/ successfully! [LOG] π Crawling done for https://mercedes-benz.com/en/, success: True, time taken: 7.95 seconds [ERROR] π« Failed to crawl https://mercedes-benz.com/en/, error: Failed to extract content from the website: https://mercedes-benz.com/en/, error: can only concatenate str (not "NoneType") to str {"level":"ERROR","time":"Tue Oct 15 2024 16:31:02 IST+0530","name":"FastAPI Python Server For","msg":"Error in crawling https://mercedes-benz.com/en/, Failed to extract content from the website: https://mercedes-benz.com/en/, error: can only concatenate str (not "NoneType") to str"}
I am using this in LLMStrategy method with instruction
and how to handle to stop when it takes too much time?, so that maybe i can switch the service?
@Mahizha-N-S Would you please share the code, so I try to replicate the exact case. Thx
The following code:
Returns:
Would love to know how I can fix it.
Making a simple python get request seems to return the HTML so it doesnt seem to be an issue with not being able to access due to headers and such