Closed immartian closed 1 month ago
Hi @immartian, One general piece of advice is to set the headless parameter to false when creating an instance of the Crowler, so you can see what is happening. With the Wall Street Journal website, for example, a human verification step is displayed, which prevents you from achieving your desired outcome. Here's a picture of that.
Here is how you can unset headless
:
async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
result = await crawler.arun(
url="https://www.wsj.com/world/china/chinas-patriotic-rhetoric-takes-a-violent-turn-6266ca09",
bypass_cache=True,
)
@unclecode are you sure there's headless
parameter supported?
Traceback (most recent call last):
File "/media/im2/plus/lab2/wsj.py", line 14, in <module>
asyncio.run(crawl())
File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/media/im2/plus/lab2/wsj.py", line 5, in crawl
async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: AsyncWebCrawler.__init__() got an unexpected keyword argument 'headless'
Many pages like https://www.wsj.com/world/china/chinas-patriotic-rhetoric-takes-a-violent-turn-6266ca09: are not crawlable. I've tried both sync and async mode, all returns failure: