jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.18k stars 323 forks source link

pitching_stats_range consistently failing to fetch data #332

Open JGHB opened 1 year ago

JGHB commented 1 year ago

I have discovered that pitching_stats_range will not fetch data. Calling the function for any data range will regularly result in an index out of range error. On occasion the call will fetch successfully but this rarely happens. Has anyone else encountered this bug?

Here is the error output for your reference: Screen Shot 2023-03-05 at 5 02 30 PM

tjburch commented 1 year ago

The query works for me:

Python 3.10.6 (main, Aug 30 2022, 05:12:36) [Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pybaseball import pitching_stats_range
>>> pitching_stats_range("2021-04-03")
                Name  Age  #days     Lev         Date             Tm  ...    PU   WHIP  BAbip   SO9  SO/W   mlbID
1    Tyler Alexander   26    700  Maj-AL  Apr 3, 2021        Detroit  ...  0.00  3.000  0.667  13.5   NaN  641302
2      Yency Almonte   27    700  Maj-NL  Apr 3, 2021       Colorado  ...  0.00  3.000  0.750  18.0   NaN  622075
3       Jose Alvarez   32    700  Maj-NL  Apr 3, 2021  San Francisco  ...  0.00  0.000  0.000   9.0   NaN  501625
4     Tyler Anderson   31    700  Maj-NL  Apr 3, 2021     Pittsburgh  ...  0.07  1.400  0.308  12.6   3.5  542881
5       Chris Archer   32    700  Maj-AL  Apr 3, 2021      Tampa Bay  ...  0.00  2.500  0.500   9.0   2.0  502042
..               ...  ...    ...     ...          ...            ...  ...   ...    ...    ...   ...   ...     ...
119      Matt Wisler   28    700  Maj-NL  Apr 3, 2021  San Francisco  ...   NaN  0.000    NaN  27.0   NaN  605538
120    Nick Wittgren   30    700  Maj-AL  Apr 3, 2021      Cleveland  ...  0.00  7.500  0.600   0.0   0.0  621295
121    Jake Woodford   24    700  Maj-NL  Apr 3, 2021      St. Louis  ...  0.00  1.714  0.400  11.6   1.5  663765
122  Brandon Workman   32    700  Maj-NL  Apr 3, 2021        Chicago  ...  0.00  0.000  0.000  18.0   NaN  519443
123     Huascar Ynoa   23    700  Maj-NL  Apr 3, 2021        Atlanta  ...  0.00  1.000  0.250   0.0   NaN  660623

[119 rows x 45 columns]

The error seems to indicate that it didn't fetch the table right. First, I'd confirm your internet is connected, then purge and/or disable your cache and try again:

from pybaseball import cache
cache.disable()

or

from pybaseball import cache
cache.purge()

If that fails try printing out the URL that's being called and/or the soup that's returned to see why there isn't a table in there

JTMachen commented 1 year ago

I get the same error. It happens if I try to call either the batting_stats_range or pitching_stats_range functions more than four times. I tried both purging and disabling the cache, but neither appears to solve the problem

tjburch commented 1 year ago

Can you list what version you're running? If not 2.2.5, upgrade and confirm it still happens there.

JTMachen commented 1 year ago

I'm running 2.2.5 and I got to seven function calls, but I still get the "list index out of range" error

tjburch commented 1 year ago

You're probably hitting request limits. Try putting a sleep of a few seconds before calling a bunch in a loop.

tjburch commented 1 year ago

Any update here?

JTMachen commented 1 year ago

Sleeping doesn't do it. I tried sleep up to 20 seconds between each pull and I'd still get the error. Even walking away for a couple hours didn't help, kept getting the index error. Might be a pull/day thing

tjburch commented 1 year ago

That is very strange. Usually the cooldown is like an hour or two.

I would try the following:

In the get_soup function, add a URL printout after it's built, after line 21 here, just do print(url) and then enter that into your browser to see if the URL you're passing is valid and if there's a table on it.

If that's ok, then then I'd add a print(soup) option right before the error, after the get_soup call, after line 63 here. That's going to be a mess. But it might have some information, if there's an error in the response it's usually findable in there.

tbryan2 commented 1 year ago

I'm also experiencing this issue with batting_stats_range(). A few calls and then I get a list index out of range error.

If the requests are so tightly constricted here, does anyone know where I can get game level batting and pitching statistics? It seems like all the functions in this library automatically sum the data in given ranges...

JTMachen commented 1 year ago

I usually loop through the dates I'm looking for, calling the range functions for each single day and concat the dataframes into one large one. But there isn't a way to take the start date and end date and get the single games that way.

Update on the pulls. The URL keeps changing, but it's something similar to:

https://www.baseball-reference.com/leagues/daily.fcgi?user_team=&bust_cache=&type=b&lastndays=7&dates=fromandto&fromandto=2022-04-07.2022-04-07&level=mlb&franch=&stat=&stat_value=0

with the 2022-04-07 occuring on other dates, like 2022-06-15, 2022-9-10, etc. so the URL's are all valid. When I find the URL it breaks on and attempt to run BS stuff on my end, I end up with an "HTTPError: HTTP Error 429: Too Many Requests." This error doesn't go away unless I wait several hours to try running it again.

tbryan2 commented 1 year ago

Right, I'm doing the concat method you mentioned. I'm even sleeping for 5-10 seconds and outputting to a local Postgres database.

No matter what, if I make more than 4-5 requests I am getting the IndexError. Is Baseball Reference really that stingy with requests?

JGHB commented 1 year ago

I was also having the same issue with batting_stats_range, but sleeping has seemed to fix the issue for me. Here's what's working for me. The same code works for batting_stats_range. Before adding sleeping to my code, I was having the same experience wherein I could fetch four or five times but then I would get locked out for a number of hours.


all_pitching_stats = pd.DataFrame([])

for month in range(4, 7):
    for i in range(1,32):
        time.sleep(10)
        if(month == (4 or 7) and i == 31):
            continue
        day = ''
        if i < 10:
            day ='2021-0' + str(month) + '-0'+str(i)
        else:
            day = '2021-0' + str(month) + '-' + str(i)
        temp=pd.DataFrame([])
        try:
            temp=pitching_stats_range(day,day)
        except:
            print(day + " Failed")
        if len(temp)>0:
            print(day + " Success")
            temp = temp.assign(Date = day)
            all_pitching_stats=all_pitching_stats.append(pd.DataFrame(temp),ignore_index=True)
klatta87 commented 1 year ago

I've gotten about half-way through a full season (~100 loops) before this fails and I'm sleeping 15 seconds. Will keep trying things.

JTMachen commented 11 months ago

This issue has gotten worse. I can only pull a single day's worth of data. I tried sleeping for upwards of 15 seconds, but ti only pulls a single day's worth of data before throwing the same error.