jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.25k stars 333 forks source link

statcast throwing KeyError on certain dates in 2023 #375

Open Haman-Karn opened 1 year ago

Haman-Karn commented 1 year ago

While getting all of the statcast data, I kept getting an error around 98%. So I eventually was able to narrow it down to 2023-06-25 being the first problematic one day. Other day(s) past this one also cause the error, but I've stopped at 06-25 because this amount of data is good enough for my current purposes.

The code I'm executing is this: stats = statcast(start_dt="2023-06-25")

Upon execution, my terminal looks like this:

This is a large query, it may take a moment to complete
  0%|                                                                                                                                   | 0/1 [00:00<?, ?it/s] 
Traceback (most recent call last):
  File "c:\Users\nosoa\Documents\glb\getstats.py", line 6, in <module>
    stats = statcast(start_dt="2023-06-25")
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 113, in statcast
    return _handle_request(start_dt_date, end_dt_date, 1, verbose=verbose,
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 76, in _handle_request
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 58, in _cached
    result = func(*args, **kwargs)
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\statcast.py", line 31, in _small_request
    data = data.sort_values(
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\frame.py", line 6740, in sort_values
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\frame.py", line 6740, in <listcomp>
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pandas\core\generic.py", line 1778, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'game_date'
ss77995ss commented 1 year ago

I have tested the stats = statcast(start_dt="2023-06-25") code on both Colab (python 3.10.12) and my local environment (3.11.2) and they worked fine.

I guess maybe something went wrong in concurrent mode according to the error message dataframe_list.append(future.result())

Maybe try to turn off the parallel will work?

stats = statcast(start_dt="2023-06-25", parallel=False)
Haman-Karn commented 1 year ago

I discovered the issue -- there must have been something corrupted in the cache. Disabling the cache fixed the problem. But attempting to purge the cache also results in an error.

Traceback (most recent call last):
  File "c:\Users\nosoa\Documents\glb\getstats.py", line 5, in <module>
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 31, in purge
    records = [cache_record.CacheRecord(filename) for filename in record_files]
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache.py", line 31, in <listcomp>
    records = [cache_record.CacheRecord(filename) for filename in record_files]
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\cache_record.py", line 23, in __init__
    self.data = cast(Dict[str, Any], file_utils.load_json(filename))
  File "c:\Users\nosoa\Documents\glb\venv\Lib\site-packages\pybaseball\cache\file_utils.py", line 28, in load_json
    return cast(JSONData, json.load(json_file))
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\nosoa\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 37 (char 36)
ss77995ss commented 1 year ago

I found that some cache file not save completely that cause cache.purge() cannot parse them. In my case, file name with prefix _small_request all only contain

{"func": "_small_request", "args": [

Because it is not valid json so it will raise decode error.

You can find the cache files from /Users/{user_name}/.pybaseball/cache or in colab /root/.pybaseball/cache

IMO, currently we can only delete those invalid cache file manually since they also do not contain expire time

ss77995ss commented 2 months ago

Should be fixed in #438