Open deepak-hl opened 1 month ago
@unclecode I am new on crawl4ai, please help me as I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I?
@deepak-hl thx fot using Crawl4Ai, I take a look at your code by tomorrow and definitely update you soon 🤓
@unclecode thank you !!
@unclecode can i crawl all the content from its sub urls by providing only its base url in crawl4ai, if yes then how?
@deepak-hl Thank you for using Crawl4ai. Let me go through your questions one by one. The first is you're using the old version, the synchronous version. And I'm not going to support that because I moved everything to the asynchronous version. Here I share with you the code example that how it's properly you can combine all these together. In this example I'm building a knowledge graph from one of the Paul Graham essay.
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
async def main():
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini',
api_token=os.getenv('OPENAI_API_KEY'),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""Extract entities and relationships from the given text."""
)
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
chunking_strategy=OverlappingWindowChunking(window_size=2000, overlap=100),
# magic=True
)
# print(result.markdown[:500])
print(result.extracted_content)
with open(os.path.join(__data__, "kb.json"), "w") as f:
f.write(result.extracted_content)
print("Done")
if __name__ == "__main__":
asyncio.run(main())
Regarding your next question to pass one URL and get all the sub-urls which is scrapping, the good news is we are already working on it, and already it is under the testing. And within few weeks, we will release the scrapper as well next to crawler function. The scrapper will handle a graph search. You give a URL and you can define how many levels you want to go or all of it. Right now there is this function arun_many([urls]). After calling
crawlfunction the response has a property
links` contains all the internal and external links of the page. You use a queue data structure, add all the internal links, then start to crawl them again, and keep adding new internal links. This is just a temporary way that you can do that but wait for our scrapper to be ready.
I hope I answered your questions let me know if you have any question.
I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I? its showing me the error : 'CrawlResult' object has no attribute 'split'