unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
2.78k stars 229 forks source link

Is it possible to extract with LLMExtractionStrategy from markdown or cleaned_html (Not from html)? #82

Closed takan1 closed 2 weeks ago

takan1 commented 2 weeks ago

Is it possible to extract with LLMExtractionStrategy from markdown or cleaned_html (Not from html)?

unclecode commented 2 weeks ago

@takan1 Right now you can use it in this way:

result = crawler.run( r"https://www.nbcnews.com/business", word_count_threshold=0)
from crawl4ai.extraction_strategy import LLMExtractionStrategy
llm_extraction_strategy = LLMExtractionStrategy(
        provider= "openai/gpt-4o-mini", api_token = os.getenv('OPENAI_API_KEY'),
        instruction="""Extract headers fromt his markdown content"""
)
extraction_result = llm_extraction_strategy.run("", [result.markdown])
print(extraction_result)

However it seems to me a good option to add to the library.