probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
511 stars 87 forks source link

'read_events' Function Ignoring 'live=False' Parameter and Issues with Group Stage vs Knockout Stage HTML Structure #619

Open ds-oliver opened 4 days ago

ds-oliver commented 4 days ago

While using the soccerdata library to scrape event data from WhoScored, I've encountered an issue where the read_events function seems to ignore the live=False parameter. Despite explicitly setting live=False, the function attempts to scrape the live URL, resulting in repeated errors. (Please ignore the "priority game" aspects of the script that is carry over from another project that I did not remove from this function call.)

Here are some relevant details:

  1. Script Parameters and Logs:

    • The script sets live=False for the read_events function.
    • However, the logs indicate that the function tries to access the live URL: https://www.whoscored.com/Matches/1787316/Live.
    INFO     Setting read_events params: match_id=1729479, output_fmt=spadl, force_cache=False, live=False             scrape_euros.py:133
    INFO     Could not find priority game 1729479.                                                                      scrape_euros.py:151
    INFO     Processing home team: Scotland [424]                                                                       scrape_euros.py:160
    INFO     Processing away team: Hungary [327]                                                                        scrape_euros.py:161
    INFO     Processing game 1787316...                                                                                 scrape_euros.py:168
    ERROR    Error while scraping https://www.whoscored.com/Matches/1787316/Live. Retrying in 0 seconds... (attempt 1 of 5). _common.py:469
  2. HTML Structure for Group Stage vs Knockouts:

    • Another observation that might be relevant is the difference in HTML structure when accessing group stage games versus knockout stage games. This difference could potentially affect the scraping process.

Steps to Reproduce:

  1. Set up a script to scrape event data using the soccerdata library.
  2. Ensure the read_events function has live=False.
  3. Run the script and observe the logs.

Expected Behavior: The read_events function should not attempt to access the live URL when live=False is set.

Actual Behavior: The function tries to scrape the live URL, leading to repeated errors.

Logs:

[06/27/24 10:19:08] INFO     Custom team name replacements loaded from                                                   _config.py:85
                             /Users/hogan/soccerdata/config/teamname_replacements.json.
[06/27/24 10:19:11] INFO     Saving cached data to /Users/hogan/soccerdata/v2                                            _common.py:92
[06/27/24 10:19:18] INFO     Team ID Map: {'Germany': 336, 'Scotland': 424, 'Hungary': 327, 'Switzerland': 423,    scrape_euros.py:298
                             'Albania': 814, 'Italy': 343, 'Spain': 338, 'Croatia': 337, 'Poland': 342,
                             'Netherlands': 335, 'England': 345, 'Serbia': 771, 'Denmark': 425, 'Slovenia': 464,
                             'Austria': 324, 'France': 341, 'Belgium': 339, 'Slovakia': 484, 'Ukraine': 462,
                             'Romania': 412, 'Portugal': 340, 'Czechia': 332, 'Georgia': 413, 'Turkiye': 333}
[06/27/24 10:38:01] ERROR    Error while scraping https://www.whoscored.com/Matches/1787316/Live. Retrying in 0 seconds... (attempt 1 of 5). _common.py:469
                             Traceback (most recent call last):
                               File "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/soccerdata/_common.py", line 460, in
                             _download_and_save
                                 response = json.dumps(self._driver.execute_script("return " + var)).encode(
                               File
                             "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line
                             408, in execute_script
                                 return self.execute(command, {"script": script, "args": converted_args})["value"]
                               File
                             "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line
                             348, in execute
                                 self.error_handler.check_response(response)
                               File
                             "/Users/hogan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py",
                             line 229, in check_response
                                 raise exception_class(message, screen, stacktrace)
                             selenium.common.exceptions.JavascriptException: Message: javascript error: requirejs is not defined
                               (Session info: chrome=126.0.6478.127)
                             Stacktrace:
                             0   undetected_chromedriver             0x000000010d5230e8 undetected_chromedriver + 5169384
                             1   undetected_chromedriver             0x000000010d51afba undetected_chromedriver + 5136314
                             2   undetected_chromedriver             0x000000010d09736c undetected_chromedriver + 402284
                             3   undetected_chromedriver             0x000000010d09cb99 undetected_chromedriver + 424857
                             4   undetected_chromedriver             0x000000010d09ec2c undetected_chromedriver + 433196
                             5   undetected_chromedriver             0x000000010d127ee8 undetected_chromedriver + 995048
                             6   undetected_chromedriver             0x000000010d107ab2 undetected_chromedriver + 862898
                             7   undetected_chromedriver             0x000000010d126f57 undetected_chromedriver + 991063
                             8   undetected_chromedriver             0x000000010d107853 und undetected_chromedriver + 862291
                             9   undetected_chromedriver             0x000000010d0d75c6 undetected_chromedriver + 665030
                             10  undetected_chromedriver             0x000000010d0d7e4e undetected_chromedriver + 667214
                             11  undetected_chromedriver             0x000000010d4e5d00 undetected_chromedriver + 4918528
                             12  undetected_chromedriver             0x000000010d4eacfd undetected_chromedriver + 4939005
                             13  undetected_chromedriver             0x000000010d4eb3d5 undetected_chromedriver + 4940757
                             14  undetected_chromedriver             0x000000010d4c6de4 undetected_chromedriver + 4791780
                             15  undetected_chromedriver             0x000000010d4eb6c9 undetected_chromedriver + 4941513
                             16  undetected_chromedriver             0x000000010d4b85b4 undetected_chromedriver + 4732340
                             17  undetected_chromedriver             0x000000010d50b898 undetected_chromedriver + 5073048
                             18  undetected_chromedriver             0x000000010d50ba57 undetected_chromedriver + 5073495
                             19  undetected_chromedriver             0x000000010d51ab6e undetected_chromedriver + 5135214
                             20  libsystem_pthread.dylib             0x00007ff819c0418b _pthread_start + 99
                             21  libsystem_pthread.dylib             0x00007ff819bffae3 thread_start + 15

Code:

# Relevant snippet showing the function call
ws.read_events(
    match_id=game_id, 
    output_fmt=output_fmt, 
    force_cache=force_cache, 
    live=live
)

Additional Context: The HTML structure for group stage games versus knockout stage games might be contributing to the issue. The difference in structure could potentially impact the scraping process.

Environment:

Potential Fix: Please investigate why the live=False parameter is not being respected by the read_events function. Additionally, consider any differences in HTML structure between group stage and knockout stage games that might affect scraping.

Thank you for your attention to this issue. Let me know if you need any additional information.