unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
14.06k stars 985 forks source link

CosineStrategy is not working #178

Closed 1933211129 closed 2 hours ago

1933211129 commented 4 hours ago

@unclecode CosineStrategy seems to be not working; the result.md when using CosineStrategy is exactly the same as when extracting content from a single URL. Why is that? I also encounter network connection issues with Hugging Face when running locally, so I'm trying to load the model directly from the local files.

Looking forward to your reply!

unclecode commented 2 hours ago

Hi, thank you for using our library. Here, I'd like to share with you the code that you need to use. The result will always be the same, which is just Markdown. However, when you use the cosine strategy, you'll notice another parameter - extracted content. This is a dumped JSON that contains the structures you're looking for. Please pay attention to the following code and then try it out. You should get what you want. Thank you.

async def main():
    extraction_strategy = CosineStrategy()
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
        )
        segments = json.loads(result.extracted_content)
        print(segments)

    print("Done")

if __name__ == "__main__":
    asyncio.run(main())
1933211129 commented 1 hour ago

@unclecode Why is it that when I set the parameter 'bypass_cache' to 'True', both 'md' and 'extracted_content' return None

1933211129 commented 1 hour ago

import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import CosineStrategy import json

async def main(): async with AsyncWebCrawler(verbose=True) as crawler:

Define extraction strategy

    strategy = CosineStrategy(
        semantic_filter="中国式现代化",
        word_count_threshold=10,
        max_dist=0.2,
        linkage_method='ward',
        top_k=3,
        model_name='BAAI/bge-small-zh-v1.5'
    )
    url = "http://www.people.com.cn/"

    result = await crawler.arun(url=url, extraction_strategy=strategy)
    print(result.model_dump().keys())
    print(result.extracted_content)
    # with open('001.txt', 'w') as f:
    #     f.write(result.extracted_content)

    # segments = json.loads(result.extracted_content)
    # print(segments)

asyncio.run(main())

The 'BAAI/ burge-small-zh-v1.5 'model is located locally and loads without problems, I modified the source code

unclecode commented 1 hour ago

Okay, I see the problem. It's a bug in the way I'm loading the model. Actually, I'm not passing model name to the loader properly. I will update it, and then in the next version, I'll fix this one, which I'll release in a few days. In the meantime, you can check it from this branch 0.3.72. So, I hope this will be a solution for you.

async def main():
    extraction_strategy = CosineStrategy(
        model_name='sentence-transformers/all-MiniLM-L6-v2'
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            # magic=True
        )
        segments = json.loads(result.extracted_content)
        print(segments)

    print("Done")
1933211129 commented 1 hour ago

@unclecode Okay, I got it. Thank you very much for your timely reply. Your project has benefited me a lot!