unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.54k stars 1.22k forks source link

how can i extract text from the CrawlResult? #171

Open deepak-hl opened 1 month ago

deepak-hl commented 1 month ago
from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import SlidingWindowChunking
from crawl4ai.extraction_strategy import LLMExtractionStrategy

     crawler = WebCrawler()
     crawler.warmup()

        strategy = LLMExtractionStrategy(
            provider='openai',
            api_token=os.getenv('OPENAI_API_KEY')
        )
        loader = crawler.run(url=all_urls[0], extraction_strategy=strategy)
        chunker = SlidingWindowChunking(window_size=2000, step=50)
        texts = chunker.chunk(loader)
        print(texts)

I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I? its showing me the error : 'CrawlResult' object has no attribute 'split'

deepak-hl commented 1 month ago

@unclecode I am new on crawl4ai, please help me as I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I?

unclecode commented 1 month ago

@deepak-hl thx fot using Crawl4Ai, I take a look at your code by tomorrow and definitely update you soon 🤓

deepak-hl commented 1 month ago

@unclecode thank you !!

deepak-hl commented 1 month ago

@unclecode can i crawl all the content from its sub urls by providing only its base url in crawl4ai, if yes then how?

unclecode commented 1 month ago

@deepak-hl Thank you for using Crawl4ai. Let me go through your questions one by one. The first is you're using the old version, the synchronous version. And I'm not going to support that because I moved everything to the asynchronous version. Here I share with you the code example that how it's properly you can combine all these together. In this example I'm building a knowledge graph from one of the Paul Graham essay.

class Entity(BaseModel):
    name: str
    description: str

class Relationship(BaseModel):
    entity1: Entity
    entity2: Entity
    description: str
    relation_type: str

class KnowledgeGraph(BaseModel):
    entities: List[Entity]
    relationships: List[Relationship]

async def main():
    extraction_strategy = LLMExtractionStrategy(
            provider='openai/gpt-4o-mini',
            api_token=os.getenv('OPENAI_API_KEY'),
            schema=KnowledgeGraph.model_json_schema(),
            extraction_type="schema",
            instruction="""Extract entities and relationships from the given text."""
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            chunking_strategy=OverlappingWindowChunking(window_size=2000, overlap=100),
            # magic=True
        )
        # print(result.markdown[:500])
        print(result.extracted_content)
        with open(os.path.join(__data__, "kb.json"), "w") as f:
            f.write(result.extracted_content)

    print("Done")

if __name__ == "__main__":
    asyncio.run(main())

Regarding your next question to pass one URL and get all the sub-urls which is scrapping, the good news is we are already working on it, and already it is under the testing. And within few weeks, we will release the scrapper as well next to crawler function. The scrapper will handle a graph search. You give a URL and you can define how many levels you want to go or all of it. Right now there is this function arun_many([urls]). After callingcrawlfunction the response has a propertylinks` contains all the internal and external links of the page. You use a queue data structure, add all the internal links, then start to crawl them again, and keep adding new internal links. This is just a temporary way that you can do that but wait for our scrapper to be ready.

I hope I answered your questions let me know if you have any question.