probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
544 stars 95 forks source link

[Whoscored] headless error with WSL #398

Closed mhd0528 closed 9 months ago

mhd0528 commented 9 months ago

Hi,

I have come across the following errors with setting headless to either True or False. Set to False: image

Set to True: image

I noticed similar issues with the error when setting it to True, which might be due to undetected-chrome. But I don't know what's the cause of the error with setting headless to False... I have installed undetected-chromedriver==3.5.3 and chromium-browser 1:85.0.4183.83-0ubuntu0.20.04.3 amd64. I noticed most of the posts saying they have chrome 116 or 117, so I'm wondering if this could be the reason since I'm using the package on WSL subsystem in Windows.

Thanks.

probberechts commented 9 months ago

This SO post explains the problem and gives a few solutions: https://stackoverflow.com/questions/77191221/undetected-chromedriver-attributeerror-chromeoptions-object-has-no-attribute

I haven't tested it yet, but downgrading selenium to 4.12 seems the most straightforward solution.

Ultimately, this has to be fixed in undetected-chromedriver (see https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1584); not in soccerdata.

Gibranium commented 9 months ago

Sorry to say it Pieter, but to me it doesn't work downgrading selenium - tried 4.12, 4,11, 4.10 - neither using the selenium base import SB method at the link. Here I've used the SB method and I have tor active, don't know what's wrong at this point.

Screenshot 2023-10-06 alle 22 58 43

probberechts commented 9 months ago

@Gibranium Your issue seems related to #395. WhoScored can detect that you are a bot when you are in headless mode. You'll have to run in non-headless mode for now.

Gibranium commented 9 months ago

Ok, I've tried again with selenium 4.12, no headless, no cache and tor, but still got the same identical error. Do I have to give up for now, and wait improvements from undetected or there's something else I can try?

mhd0528 commented 9 months ago

I have it worked with the same setting as yours. Are you setting everything up in a Linux system? Or maybe try the latest version of undetected-chromedriver (3.5.2 I think) as mentioned in another issue?

Gibranium commented 9 months ago

@mhd0528 nope, I'm on Mac M2. I've tried with undetected on 3.5.2, but it requires Chrome on 114 and I don't think it's supported on Apple silicon. Maybe Apple silicon it's the problem? Because until May I used an Intel Mac and all worked perfectly.

mhd0528 commented 9 months ago

Yea, I think that's possible. I saw other issues about the version of Chrome can also lead to version problems.

marcjbaron commented 9 months ago

Sorry if this is not the place for it, but using the most recent release (#412) still gives the same issues using Selenium ( >=4.11) and undetected-chromedriver (3.5.3).

Running with headless==True gives the expected error message:

ValueError: Headless mode is not supported for Selenium 4.13.0 and above. Please downgrade to a lower version of Selenium or set 'headless=False'.

Running with headless==False gives a similar error message as a previous comment from Gibranium when running in headless mode; to be specific, running the command

ws = sd.WhoScored(leagues=["USA-MLS"], seasons = ["2023"], no_cache=True, headless=False, proxy='tor')

gives the following error: Screenshot from 2023-10-14 12-42-55

After 5 attempts of this, the typical existing error is given:

FOLDER/.analytics-env/lib/python3.10/site-packages/soccerdata/_common.py", line 463, in _download_and_save raise ConnectionError("Could not download %s." % url)

ConnectionError: Could not download https://www.whoscored.com.

probberechts commented 9 months ago

@marcjbaron There were two different issues discussed in this thread:

  1. Selenium 4.13 no longer supports headless mode. As a consequence, running the whoscored scraper with "headless=True" resulted in a crash. This is fixed in the latest release of soccerdata.
  2. The undetected-chromedriver library is used to patch selenium such that it does not trigger anti-bot services. Currently, this does not seem to work / sufficient for everyone (works fine for me though). The only thing I can do here is to wait until undetected-chromedriver gets updated. But one thing you could try is not using tor. Some IPs of tor nodes might have been blacklisted.