jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.17k stars 321 forks source link

Issue with batting_stats_range() function - IndexError: list index out of range #364

Open MWTyrone opened 1 year ago

MWTyrone commented 1 year ago

Hello,

I'm new to Python and pybaseball, and I've encountered an issue when trying to use the batting_stats_range() function to retrieve batting stats.

Here is the code I'm using:

from pybaseball import batting_stats_range import pandas as pd

Define the date range for the batting stats

batting_data = batting_stats_range('2018-01-01', '2023-12-31')

Save the data to a CSV file

batting_data.to_csv('batting_data_2018_2023.csv', index=False)

When I run this code, I receive the following error:

**Traceback (most recent call last): File "[directory]\batting_stats_pull.py", line 5, in batting_data = batting_stats_range('2018-01-01', '2023-12-31') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[directory]\env\Lib\site-packages\pybaseball\league_batting_stats.py", line 62, in batting_stats_range table = get_table(soup) ^^^^^^^^^^^^^^^ File "[directory]\env\Lib\site-packages\pybaseball\league_batting_stats.py", line 28, in get_table table = soup.find_all('table')[0]


IndexError: list index out of range**

I've attempted to update pybaseball using pip install --upgrade pybaseball, but this didn't resolve the issue.

Upon looking into the source code, I found that the get_soup() function in league_batting_stats.py is using web scraping to retrieve data from a specific URL on the Baseball-Reference website. I'm wondering if the structure of the website has changed since the pybaseball library was last updated, or if there are measures in place on the website that prevent or limit web scraping?

Apologies in advance if this is a known or simple issue - again, I'm new to this.  Any help or guidance you can provide would be greatly appreciated.  

Thank you,
MWT
JTMachen commented 1 year ago

So, I have some slightly good news, followed by some not so good news.

This is a known issue (the other issue is here: https://github.com/jldbc/pybaseball/issues/332), unfortunately the "fix" isn't all that great. For starters, I'm not sure how to pull that many dates at once. I wanted each individual game day stat line, so I went date by date. When I did that, I had to call a sleep() function between calls, which drastically increased the time it took to complete. It's not a great solution, but I eventually got it to work.

As far as calling all the stats over that long of a time, I'm not sure of a potential solution along the sleep() call route, sorry.