unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.24k stars 1.19k forks source link

CosineStrategy is not working #178

Closed 1933211129 closed 1 month ago

1933211129 commented 1 month ago

@unclecode CosineStrategy seems to be not working; the result.md when using CosineStrategy is exactly the same as when extracting content from a single URL. Why is that? I also encounter network connection issues with Hugging Face when running locally, so I'm trying to load the model directly from the local files.

Looking forward to your reply!

unclecode commented 1 month ago

Hi, thank you for using our library. Here, I'd like to share with you the code that you need to use. The result will always be the same, which is just Markdown. However, when you use the cosine strategy, you'll notice another parameter - extracted content. This is a dumped JSON that contains the structures you're looking for. Please pay attention to the following code and then try it out. You should get what you want. Thank you.

async def main():
    extraction_strategy = CosineStrategy()
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
        )
        segments = json.loads(result.extracted_content)
        print(segments)

    print("Done")

if __name__ == "__main__":
    asyncio.run(main())
1933211129 commented 1 month ago

@unclecode Why is it that when I set the parameter 'bypass_cache' to 'True', both 'md' and 'extracted_content' return None

1933211129 commented 1 month ago

import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import CosineStrategy import json

async def main(): async with AsyncWebCrawler(verbose=True) as crawler:

Define extraction strategy

    strategy = CosineStrategy(
        semantic_filter="中国式现代化",
        word_count_threshold=10,
        max_dist=0.2,
        linkage_method='ward',
        top_k=3,
        model_name='BAAI/bge-small-zh-v1.5'
    )
    url = "http://www.people.com.cn/"

    result = await crawler.arun(url=url, extraction_strategy=strategy)
    print(result.model_dump().keys())
    print(result.extracted_content)
    # with open('001.txt', 'w') as f:
    #     f.write(result.extracted_content)

    # segments = json.loads(result.extracted_content)
    # print(segments)

asyncio.run(main())

The 'BAAI/ burge-small-zh-v1.5 'model is located locally and loads without problems, I modified the source code

unclecode commented 1 month ago

Okay, I see the problem. It's a bug in the way I'm loading the model. Actually, I'm not passing model name to the loader properly. I will update it, and then in the next version, I'll fix this one, which I'll release in a few days. In the meantime, you can check it from this branch 0.3.72. So, I hope this will be a solution for you.

async def main():
    extraction_strategy = CosineStrategy(
        model_name='sentence-transformers/all-MiniLM-L6-v2'
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            # magic=True
        )
        segments = json.loads(result.extracted_content)
        print(segments)

    print("Done")
1933211129 commented 1 month ago

@unclecode Okay, I got it. Thank you very much for your timely reply. Your project has benefited me a lot!

unclecode commented 1 month ago

@1933211129 Really I feel empower hearing this library helps you, happy crawling