unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.38k stars 1.2k forks source link

What about parallel updates #239

Open 1933211129 opened 2 weeks ago

1933211129 commented 2 weeks ago

Hi there, @unclecode ! I noticed that the library has been updated to 0.3.73, 'Parallel Power: Supercharged multi-URL crawling performance', what are the specific updates in 'multi-URL crawling'? I used 'crawler.arun_many' to get 40 links in version 0.3.72 and it took 8-10 seconds, but updating to '0.3.73' does not reduce the runtime, does it need more links? I now want to reduce the time spent on multiple link crawls as much as possible.

I have asked you many times about this library. Now I am a new student of AI, and my questions are very simple or even stupid. Thank you very much for your patient responses!

unclecode commented 2 weeks ago

Hey @1933211129 no worries, all questions as long as relevant to Crawl4ai are welcome 😃. Let’s do this: send me a set of links you’re currently crawling, and on your end, run it a few times to get the average time it takes to retrieve all results. I’ll test it with the same links on my end, compare the results, and then suggest the best approach.

As for updates, we’ve made some minor performance improvements, especially with algorithm adjustments that make things a bit faster. We’re also prepping the scraper module, it’s done, and I’m currently reviewing the code after receiving the pull request from collaborators. That should be available soon.

So, send me the links and the average time it takes for you, and I’ll give it a try on my side.

1933211129 commented 2 weeks ago

Thank you so much for getting back to me quickly. I'm sorry that all the URLs I tested are from websites in mainland China. I'm in Beijing, so when I test with international URLs, it’s hard to tell if the longer runtime is due to network lag or the code itself. Using a VPN hasn’t really helped either.

For some reason, the average time to crawl these 40 links has now jumped to about 40 seconds. I ran the test 20 times, and it consistently averaged around 43 seconds.

Our team is eager to get crawl4ai working in a real application,and my advisor is pushing me, so I don’t have much time to dive into the original code. As a student, I’d love to learn the principles behind it, but right now, I really need more of your help. Thanks again! test_parallel.txt

The code is in 'test_parallel.txt'

unclecode commented 2 weeks ago

@1933211129 understood, I'll check your links this weekend and hopefully provide good results to help you. Feel free to share your mainland China links, I can access and test them too.

1933211129 commented 2 weeks ago

Thank you for your prompt response and willingness to assist. I understand that you have a busy schedule, so please do not feel rushed.

I truly appreciate your help and look forward to any guidance or feedback you may provide based on your assessment.

---- Replied Message ---- | From | @.> | | Date | 11/08/2024 20:22 | | To | @.> | | Cc | YuanBo @.>@.> | | Subject | Re: [unclecode/crawl4ai] What about parallel updates (Issue #239) |

@1933211129 understood, I'll check your links this weekend and hopefully provide good results to help you.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

unclecode commented 1 week ago

Hi @1933211129 I am testing on links you shared with me. What is the desire "total duration" you have in your mind for this 40 urls that you have shared?

1933211129 commented 1 week ago

I'm sorry I just saw your reply. I ran the test 20 times, and it consistently averaged around 43 seconds. This is a bit too long, I expect the total run time to be around 10-20 seconds.

.

@. | ---- Replied Message ---- | From | @.> | | Date | 11/11/2024 15:20 | | To | @.> | | Cc | YuanBo @.>, @.***> | | Subject | Re: [unclecode/crawl4ai] What about parallel updates (Issue #239) |

Hi @1933211129 I am testing on links you shared with me. What is the desire "total duration" you have in your mind for this 40 urls that you have shared?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

unclecode commented 1 week ago

Ok got it, so you are looking for 10-20 seconds for total crawling of these 40 links, checking...

unclecode commented 5 days ago

@1933211129 I hope you are doing well. It has been a busy week. A lot of improvements are being applied in the library, and especially the processing time of the scraping has become a lot faster, below 100 milliseconds. I'm very happy about that. And the only time element I can't control is when we have to wait to fetch the data from the internet. It's very relevant to the servers and bandwidth on the internet that you are running them on.

I tested some of your links; some of them I couldn't open. They go to a loading screen, even in browsers. Some of them are really slow. Usually in production this is what we do:

First of all, try to filter out websites whose servers are very slow. Because bringing them into the game impacts negatively on the asynchronous process that handles parallelism. So you put them out and create a good enough list of good pages, and you crawl them all, then you go for the second batch of those slow web pages and web servers. Then you set a page timeout to an acceptable amount, like 20 seconds. This way, you cut out those various low-upside ones and leave them aside, and finally, you handle those various slow ones in the last batch in a separate crawl process. In this way, you're not putting the entire crawling process on hold, waiting for these slow websites.


import os, sys
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath(
    os.path.join(os.getcwd(), os.path.dirname(__file__)))
import asyncio, time
# from ..crawl4ai import AsyncWebCrawler
from crawl4ai import AsyncWebCrawler, CacheMode

async def test_concurrent_crawling_performance():
    async with AsyncWebCrawler(verbose=True) as crawler:
        urls = [
    "https://news.gmw.cn/2024-10/31/content_37647561.htm",
    "https://www.bistu.edu.cn/xysx/449105.html",
    "http://www.las.cas.cn/news/fwcx/202410/t20241031_7411933.html",
    "http://www.las.cas.cn/news/fwcx/202410/t20241031_7411861.html",
    "http://politics.people.com.cn/n1/2024/1101/c1001-40351585.html",
    "http://politics.people.com.cn/n1/2024/1101/c1001-40351586.html",
    "http://politics.people.com.cn/n1/2024/1101/c1001-40351587.html",
        ]

        results = await crawler.arun_many(
            urls=urls,
            cache_mode=CacheMode.BYPASS,
            page_timeout=20000
        )
        print(len(results))        

asyncio.run(test_concurrent_crawling_performance())

Here on my machine I got this result:

[INIT].... → Crawl4AI 0.3.74
[INIT].... → Warming up AsyncWebCrawler
[READY]... ✓ AsyncWebCrawler initialized
[INIT].... → Starting concurrent crawling for 7 URLs...
[PARALLEL] Started task for https://news.gmw.cn/2024-10/31/content_37647561.ht...
[PARALLEL] Started task for https://www.bistu.edu.cn/xysx/449105.html...
[PARALLEL] Started task for http://www.las.cas.cn/news/fwcx/202410/t20241031_7...
[PARALLEL] Started task for http://www.las.cas.cn/news/fwcx/202410/t20241031_7...
[PARALLEL] Started task for http://politics.people.com.cn/n1/2024/1101/c1001-4...
[PARALLEL] Started task for http://politics.people.com.cn/n1/2024/1101/c1001-4...
[PARALLEL] Started task for http://politics.people.com.cn/n1/2024/1101/c1001-4...
[FETCH]... ↓ Live fetch for http://politics.people.com.cn/n1/2024/1101/c1001-40351585.html... | Status: True | Time: 2.02s
[SCRAPE].. ◆ Processed http://politics.people.com.cn/... | Time: 125ms
[COMPLETE] ● http://politics.people.com.cn/... | Status: True | Total: 2.19s
[FETCH]... ↓ Live fetch for http://www.las.cas.cn/news/fwcx/202410/t20241031_7411861.html... | Status: True | Time: 2.05s
[SCRAPE].. ◆ Processed http://www.las.cas.cn/news/fwc... | Time: 105ms
[COMPLETE] ● http://www.las.cas.cn/news/fwc... | Status: True | Total: 2.20s
[FETCH]... ↓ Live fetch for http://politics.people.com.cn/n1/2024/1101/c1001-40351586.html... | Status: True | Time: 2.29s
[SCRAPE].. ◆ Processed http://politics.people.com.cn/... | Time: 119ms
[COMPLETE] ● http://politics.people.com.cn/... | Status: True | Total: 2.46s
[FETCH]... ↓ Live fetch for http://politics.people.com.cn/n1/2024/1101/c1001-40351587.html... | Status: True | Time: 2.62s
[SCRAPE].. ◆ Processed http://politics.people.com.cn/... | Time: 106ms
[COMPLETE] ● http://politics.people.com.cn/... | Status: True | Total: 2.77s
[FETCH]... ↓ Live fetch for http://www.las.cas.cn/news/fwcx/202410/t20241031_7411933.html... | Status: True | Time: 6.73s
[SCRAPE].. ◆ Processed http://www.las.cas.cn/news/fwc... | Time: 99ms
[COMPLETE] ● http://www.las.cas.cn/news/fwc... | Status: True | Total: 6.87s
[FETCH]... ↓ Live fetch for https://news.gmw.cn/2024-10/31/content_37647561.htm... | Status: True | Time: 10.34s
[SCRAPE].. ◆ Processed https://news.gmw.cn/2024-10/31... | Time: 75ms
[COMPLETE] ● https://news.gmw.cn/2024-10/31... | Status: True | Total: 10.44s
[FETCH]... ↓ Live fetch for https://www.bistu.edu.cn/xysx/449105.html... | Status: True | Time: 16.49s
[SCRAPE].. ◆ Processed https://www.bistu.edu.cn/xysx/... | Time: 124ms
[COMPLETE] ● https://www.bistu.edu.cn/xysx/... | Status: True | Total: 16.67s
[COMPLETE] ● Concurrent crawling completed for 7 URLs | Total time: 16.68s

Here I have rounded up seven links. Total took around 16 plus seconds, let's say 17 seconds, which by average is around 2.4, 2.5 seconds per URL, which is pretty good. But I need you to pay attention to exact numbers. Look at those lines that start with the [FETCH]. Those lines are explaining how long we spent for the browser to load the page, which basically means how long it took the web server behind the URL to respond back to us and you can see that some of these links, some of their services, are very slow; one of them took like 10 seconds, the majority of them around two to three.

Then look at these lines starts with [SCRAP]; those lines that started to scrape show how long it took for the crawl4ai to do the job. As you can see, that's around 100 milliseconds, that's very fascinating. So you have to look at these two numbers to see how you will get what you need.

This is version 0.3.74 and I'm going to release it tonight or tomorrow; then you can try it and use it, and in this way, you can understand what, where you are losing the time. And again, remember that this arun_many() function is not the best optimized way to run parallelism. I'm working on a parallel executor and I will release it very soon; if you use that one, it will be even faster. I hope this can help you.

1933211129 commented 5 days ago

@unclecode I am truly delighted to receive your detailed response. I fully understand how busy your work must be, and I hope that amidst your busy schedule, you remember to take good care of yourself.

Thank you so much for your thorough explanation—I understand what you’ve mentioned. I am very much looking forward to version 0.3.74! Once again, thank you so much for your efforts! This is truly remarkable work!!!!

unclecode commented 5 days ago

Thank you for your kind words and wishes! I’ll definitely keep that in mind, and I wish you the best as well. If you have any questions about your projects, feel free to reach out! ☺️