Closed hishambutt closed 1 month ago
Hi @hishambutt The thing is that this NBC News website or any other website may change their css class structures. Currently, the one we used in the example no longer seems to be working. First, we are updating our examples to use a more stable website so that users can try it at all times and the example will work. In this case, please change your code to the following code, and then you will see the extracted results.
import asyncio
from crawl4ai import AsyncWebCrawler
import base64
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
async def main():
schema = {
"name": "News Articles",
"baseSelector": ".wide-tease-item__info-wrapper",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text",
},
{
"name": "summary",
"selector": "div.wide-tease-item__description",
"type": "text",
}
],
}
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
bypass_cache=True,
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True)
)
extracted_data = json.loads(result.extracted_content)
print(f"Extracted {len(extracted_data)} articles")
print(json.dumps(extracted_data[0], indent=2))
if __name__ == "__main__":
asyncio.run(main())
Additionally, I have changed the caption of this issue because the way you described it is not relevant to the problem here and this may cause confusion. Again, thank you so much for using our library. I hope this could be of help to you.
How can I change the browser?
output