Can't crawl WSJ sites - Githubissues

immartian commented 1 month ago

Many pages like https://www.wsj.com/world/china/chinas-patriotic-rhetoric-takes-a-violent-turn-6266ca09: are not crawlable. I've tried both sync and async mode, all returns failure:

[ERROR] 🚫 Failed to crawl https://www.nbcnews.com/business, error: Failed to crawl https://www.nbcnews.com/business: Timeout 30000ms exceeded.
=========================== logs ===========================
"load" event fired
============================================================
url='https://www.nbcnews.com/business' html='' success=False cleaned_html=None media={} links={} screenshot=None markdown=None extracted_content=None metadata=None error_message='Failed to crawl https://www.nbcnews.com/business: Timeout 30000ms exceeded.\n=========================== logs ===========================\n"load" event fired\n============================================================' session_id=None responser_headers=None status_code=None
Using a Proxy

unclecode commented 1 month ago

Hi @immartian, One general piece of advice is to set the headless parameter to false when creating an instance of the Crowler, so you can see what is happening. With the Wall Street Journal website, for example, a human verification step is displayed, which prevents you from achieving your desired outcome. Here's a picture of that.

Here is how you can unset headless :

async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
    result = await crawler.arun(
        url="https://www.wsj.com/world/china/chinas-patriotic-rhetoric-takes-a-violent-turn-6266ca09",
        bypass_cache=True,
    )

immartian commented 1 month ago

@unclecode are you sure there's headless parameter supported?

Traceback (most recent call last):
  File "/media/im2/plus/lab2/wsj.py", line 14, in <module>
    asyncio.run(crawl())
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/media/im2/plus/lab2/wsj.py", line 5, in crawl
    async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: AsyncWebCrawler.__init__() got an unexpected keyword argument 'headless'

unclecode / crawl4ai

Can't crawl WSJ sites #154