Open JGHB opened 1 year ago
The query works for me:
Python 3.10.6 (main, Aug 30 2022, 05:12:36) [Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybaseball import pitching_stats_range
>>> pitching_stats_range("2021-04-03")
Name Age #days Lev Date Tm ... PU WHIP BAbip SO9 SO/W mlbID
1 Tyler Alexander 26 700 Maj-AL Apr 3, 2021 Detroit ... 0.00 3.000 0.667 13.5 NaN 641302
2 Yency Almonte 27 700 Maj-NL Apr 3, 2021 Colorado ... 0.00 3.000 0.750 18.0 NaN 622075
3 Jose Alvarez 32 700 Maj-NL Apr 3, 2021 San Francisco ... 0.00 0.000 0.000 9.0 NaN 501625
4 Tyler Anderson 31 700 Maj-NL Apr 3, 2021 Pittsburgh ... 0.07 1.400 0.308 12.6 3.5 542881
5 Chris Archer 32 700 Maj-AL Apr 3, 2021 Tampa Bay ... 0.00 2.500 0.500 9.0 2.0 502042
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
119 Matt Wisler 28 700 Maj-NL Apr 3, 2021 San Francisco ... NaN 0.000 NaN 27.0 NaN 605538
120 Nick Wittgren 30 700 Maj-AL Apr 3, 2021 Cleveland ... 0.00 7.500 0.600 0.0 0.0 621295
121 Jake Woodford 24 700 Maj-NL Apr 3, 2021 St. Louis ... 0.00 1.714 0.400 11.6 1.5 663765
122 Brandon Workman 32 700 Maj-NL Apr 3, 2021 Chicago ... 0.00 0.000 0.000 18.0 NaN 519443
123 Huascar Ynoa 23 700 Maj-NL Apr 3, 2021 Atlanta ... 0.00 1.000 0.250 0.0 NaN 660623
[119 rows x 45 columns]
The error seems to indicate that it didn't fetch the table right. First, I'd confirm your internet is connected, then purge and/or disable your cache and try again:
from pybaseball import cache
cache.disable()
or
from pybaseball import cache
cache.purge()
If that fails try printing out the URL that's being called and/or the soup that's returned to see why there isn't a table in there
I get the same error. It happens if I try to call either the batting_stats_range or pitching_stats_range functions more than four times. I tried both purging and disabling the cache, but neither appears to solve the problem
Can you list what version you're running? If not 2.2.5, upgrade and confirm it still happens there.
I'm running 2.2.5 and I got to seven function calls, but I still get the "list index out of range" error
You're probably hitting request limits. Try putting a sleep of a few seconds before calling a bunch in a loop.
Any update here?
Sleeping doesn't do it. I tried sleep up to 20 seconds between each pull and I'd still get the error. Even walking away for a couple hours didn't help, kept getting the index error. Might be a pull/day thing
That is very strange. Usually the cooldown is like an hour or two.
I would try the following:
In the get_soup
function, add a URL printout after it's built, after line 21 here, just do print(url)
and then enter that into your browser to see if the URL you're passing is valid and if there's a table on it.
If that's ok, then then I'd add a print(soup)
option right before the error, after the get_soup
call, after line 63 here. That's going to be a mess. But it might have some information, if there's an error in the response it's usually findable in there.
I'm also experiencing this issue with batting_stats_range(). A few calls and then I get a list index out of range error.
If the requests are so tightly constricted here, does anyone know where I can get game level batting and pitching statistics? It seems like all the functions in this library automatically sum the data in given ranges...
I usually loop through the dates I'm looking for, calling the range functions for each single day and concat the dataframes into one large one. But there isn't a way to take the start date and end date and get the single games that way.
Update on the pulls. The URL keeps changing, but it's something similar to:
with the 2022-04-07 occuring on other dates, like 2022-06-15, 2022-9-10, etc. so the URL's are all valid. When I find the URL it breaks on and attempt to run BS stuff on my end, I end up with an "HTTPError: HTTP Error 429: Too Many Requests." This error doesn't go away unless I wait several hours to try running it again.
Right, I'm doing the concat method you mentioned. I'm even sleeping for 5-10 seconds and outputting to a local Postgres database.
No matter what, if I make more than 4-5 requests I am getting the IndexError. Is Baseball Reference really that stingy with requests?
I was also having the same issue with batting_stats_range
, but sleeping has seemed to fix the issue for me. Here's what's working for me. The same code works for batting_stats_range
. Before adding sleeping to my code, I was having the same experience wherein I could fetch four or five times but then I would get locked out for a number of hours.
all_pitching_stats = pd.DataFrame([])
for month in range(4, 7):
for i in range(1,32):
time.sleep(10)
if(month == (4 or 7) and i == 31):
continue
day = ''
if i < 10:
day ='2021-0' + str(month) + '-0'+str(i)
else:
day = '2021-0' + str(month) + '-' + str(i)
temp=pd.DataFrame([])
try:
temp=pitching_stats_range(day,day)
except:
print(day + " Failed")
if len(temp)>0:
print(day + " Success")
temp = temp.assign(Date = day)
all_pitching_stats=all_pitching_stats.append(pd.DataFrame(temp),ignore_index=True)
I've gotten about half-way through a full season (~100 loops) before this fails and I'm sleeping 15 seconds. Will keep trying things.
This issue has gotten worse. I can only pull a single day's worth of data. I tried sleeping for upwards of 15 seconds, but ti only pulls a single day's worth of data before throwing the same error.
I have discovered that pitching_stats_range will not fetch data. Calling the function for any data range will regularly result in an index out of range error. On occasion the call will fetch successfully but this rarely happens. Has anyone else encountered this bug?
Here is the error output for your reference: