unclecode / crawl4ai

🔥🕷️ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
https://crawl4ai.com
Apache License 2.0
17k stars 1.26k forks source link

Can't crawl WSJ sites #154

Closed immartian closed 1 month ago

immartian commented 1 month ago

Many pages like https://www.wsj.com/world/china/chinas-patriotic-rhetoric-takes-a-violent-turn-6266ca09: are not crawlable. I've tried both sync and async mode, all returns failure:

[ERROR] 🚫 Failed to crawl https://www.nbcnews.com/business, error: Failed to crawl https://www.nbcnews.com/business: Timeout 30000ms exceeded.
=========================== logs ===========================
"load" event fired
============================================================
url='https://www.nbcnews.com/business' html='' success=False cleaned_html=None media={} links={} screenshot=None markdown=None extracted_content=None metadata=None error_message='Failed to crawl https://www.nbcnews.com/business: Timeout 30000ms exceeded.\n=========================== logs ===========================\n"load" event fired\n============================================================' session_id=None responser_headers=None status_code=None
Using a Proxy
unclecode commented 1 month ago

Hi @immartian, One general piece of advice is to set the headless parameter to false when creating an instance of the Crowler, so you can see what is happening. With the Wall Street Journal website, for example, a human verification step is displayed, which prevents you from achieving your desired outcome. Here's a picture of that.

image

Here is how you can unset headless :

async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
    result = await crawler.arun(
        url="https://www.wsj.com/world/china/chinas-patriotic-rhetoric-takes-a-violent-turn-6266ca09",
        bypass_cache=True,
    )
immartian commented 1 month ago

@unclecode are you sure there's headless parameter supported?

Traceback (most recent call last):
  File "/media/im2/plus/lab2/wsj.py", line 14, in <module>
    asyncio.run(crawl())
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/media/im2/plus/lab2/wsj.py", line 5, in crawl
    async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: AsyncWebCrawler.__init__() got an unexpected keyword argument 'headless'