unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
14.23k stars 996 forks source link

Issue with JsonCssExtractionStrategy #163

Closed hishambutt closed 1 week ago

hishambutt commented 1 week ago

How can I change the browser?

import asyncio
from crawl4ai import AsyncWebCrawler
import base64
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def main():
    schema = {
        "name": "News Articles",
        "baseSelector": "article.tease-card",
        "fields": [
            {
                "name": "title",
                "selector": "h2",
                "type": "text",
            },
            {
                "name": "summary",
                "selector": "div.tease-card__info",
                "type": "text",
            }
        ],
    }

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True)
        )
        extracted_data = json.loads(result.extracted_content)
        print(f"Extracted {len(extracted_data)} articles")
        print(json.dumps(extracted_data[0], indent=2))

if __name__ == "__main__":
    asyncio.run(main())

output

Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.
[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.05 seconds
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.05 seconds.
Extracted 258 articles
{
  "index": 0,
  "tags": [],
  "content": "IE 11 is not supported. For an optimal experience visit our site on another\nbrowser."
}
unclecode commented 1 week ago

Hi @hishambutt The thing is that this NBC News website or any other website may change their css class structures. Currently, the one we used in the example no longer seems to be working. First, we are updating our examples to use a more stable website so that users can try it at all times and the example will work. In this case, please change your code to the following code, and then you will see the extracted results.

import asyncio
from crawl4ai import AsyncWebCrawler
import base64
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

async def main():
    schema = {
        "name": "News Articles",
        "baseSelector": ".wide-tease-item__info-wrapper",
        "fields": [
            {
                "name": "title",
                "selector": "h2",
                "type": "text",
            },
            {
                "name": "summary",
                "selector": "div.wide-tease-item__description",
                "type": "text",
            }
        ],
    }

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True,
            extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True)
        )
        extracted_data = json.loads(result.extracted_content)
        print(f"Extracted {len(extracted_data)} articles")
        print(json.dumps(extracted_data[0], indent=2))

if __name__ == "__main__":
    asyncio.run(main())

Additionally, I have changed the caption of this issue because the way you described it is not relevant to the problem here and this may cause confusion. Again, thank you so much for using our library. I hope this could be of help to you.