unclecode / crawl4ai

🔥🕷️ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
https://crawl4ai.com
Apache License 2.0
17.01k stars 1.26k forks source link

why init server is too slow? #90

Closed lyh0825 closed 2 months ago

lyh0825 commented 2 months ago

when i init crawl4ai server by

class WebCrawlerServer(WebCrawler):

    def __init__(self, *params, **kwargs):
        super().__init__(*params, **kwargs)
        self.ready = True

    def warmup(self):
        logger.info("[LOG] 🌤️  Warming up the WebCrawler")
        result = self.run(
            url='https://www.blurrai.com',
            word_count_threshold=5,
            extraction_strategy=None,
            bypass_cache=False,
            verbose=False,
        )
        logger.info(result)
        self.ready = True
        logger.info("[LOG] 🌞 WebCrawler is ready to crawl")

it will cost so much time to init server, i want to know why?

image

adminChina commented 2 months ago

Without further information, we cannot confirm the slow startup issue. Can you provide an independent example

adminChina commented 2 months ago

If it is called like this

    crawler = WebCrawler(verbose=True, crawler_strategy=crawler_comm_strategy)
    crawler.warmup()

Should be able to print out

[LOG] 🌤️  Warming up the WebCrawler
result xxx
[LOG] 🌞 WebCrawler is ready to crawl

If warmup() is not manually called, it will not print

But if you have already defaulted to self. ready=True, then not calling warmup is actually considered to be already started

lyh0825 commented 2 months ago

if self.run(cachepass=true), it will init server so fast,, ok

adminChina commented 2 months ago

如果传入 bypass_cache=True,直接拿的缓存数据,那启动肯定快。应该确认一下机器性能和网络速度。

lyh0825 commented 2 months ago

url抓取服务是海外的么, 我在大陆的机器部署使用的

adminChina commented 2 months ago

你如果用的是 CloudCrawlerStrategy ,他的服务是海外的,建议直接用LocalSeleniumCrawlerStrategy,完全依赖本地浏览器获取数据会快很多。

adminChina commented 2 months ago

我这边调用是很快就拿到结果了,不调用他的服务器。

adminChina commented 2 months ago

还有一点慢的原因 就是 使用了 LLMExtractionStrategy ,这个解析器,调用的是外网的服务,解析也会很慢,如果没有调用直接是NoExtractionStrategy 应该没影响,要使用 LLMExtractionStrategy 可以设置成智普Api, openai/glm-4-flash api https://open.bigmodel.cn/api/paas/v4 免费速度快。

lyh0825 commented 2 months ago

好的, 我试下

unclecode commented 1 month ago

Hi everyone, @lyh08250 I apologize for missing this issue. Since the 9th of September, we have done tons of changes, one of which is moving the entire library to an asynchronous version – it is much faster, and the performance is significantly better. I hope you can test it again and see the difference. Here is a sample of the code to help you get a quick start.

async def simple_crawl():
    print("\n--- Basic Usage ---")
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown[:500])  # Print first 500 characters