Closed 1933211129 closed 1 month ago
Hi, thank you for using our library. Here, I'd like to share with you the code that you need to use. The result will always be the same, which is just Markdown. However, when you use the cosine strategy, you'll notice another parameter - extracted content. This is a dumped JSON that contains the structures you're looking for. Please pay attention to the following code and then try it out. You should get what you want. Thank you.
async def main():
extraction_strategy = CosineStrategy()
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
)
segments = json.loads(result.extracted_content)
print(segments)
print("Done")
if __name__ == "__main__":
asyncio.run(main())
@unclecode Why is it that when I set the parameter 'bypass_cache' to 'True', both 'md' and 'extracted_content' return None
import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import CosineStrategy import json
async def main(): async with AsyncWebCrawler(verbose=True) as crawler:
strategy = CosineStrategy(
semantic_filter="中国式现代化",
word_count_threshold=10,
max_dist=0.2,
linkage_method='ward',
top_k=3,
model_name='BAAI/bge-small-zh-v1.5'
)
url = "http://www.people.com.cn/"
result = await crawler.arun(url=url, extraction_strategy=strategy)
print(result.model_dump().keys())
print(result.extracted_content)
# with open('001.txt', 'w') as f:
# f.write(result.extracted_content)
# segments = json.loads(result.extracted_content)
# print(segments)
asyncio.run(main())
The 'BAAI/ burge-small-zh-v1.5 'model is located locally and loads without problems, I modified the source code
Okay, I see the problem. It's a bug in the way I'm loading the model. Actually, I'm not passing model name to the loader properly. I will update it, and then in the next version, I'll fix this one, which I'll release in a few days. In the meantime, you can check it from this branch 0.3.72. So, I hope this will be a solution for you.
async def main():
extraction_strategy = CosineStrategy(
model_name='sentence-transformers/all-MiniLM-L6-v2'
)
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
# magic=True
)
segments = json.loads(result.extracted_content)
print(segments)
print("Done")
@unclecode Okay, I got it. Thank you very much for your timely reply. Your project has benefited me a lot!
@1933211129 Really I feel empower hearing this library helps you, happy crawling
@unclecode CosineStrategy seems to be not working; the
result.md
when usingCosineStrategy
is exactly the same as when extracting content from asingle URL
. Why is that? I also encounter network connection issues with Hugging Face when running locally, so I'm trying to load the model directly from the local files.Looking forward to your reply!