jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.18k stars 321 forks source link

download_lahman() failing #391

Open double-dose-larry opened 7 months ago

double-dose-larry commented 7 months ago

Hi All,

I'm running pybaseball 2.2.7

I'm trying to run pybaseball.people() and getting the following stack trace:

---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
Cell In[12], line 1
----> 1 download_lahman()

File ~/.local/lib/python3.10/site-packages/pybaseball/lahman.py:30, in download_lahman()
     28 def download_lahman():
     29     # download entire lahman db to present working directory
---> 30     z = get_lahman_zip()
     31     if z is not None:
     32         z.extractall(cache.config.cache_directory)

File ~/.local/lib/python3.10/site-packages/pybaseball/lahman.py:25, in get_lahman_zip()
     23 elif not _handle:
     24     s = requests.get(url, stream=True)
---> 25     _handle = ZipFile(BytesIO(s.content))
     26 return _handle

File /usr/lib/python3.10/zipfile.py:1269, in ZipFile.__init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1267 try:
   1268     if mode == 'r':
-> 1269         self._RealGetContents()
   1270     elif mode in ('w', 'x'):
   1271         # set the modified flag so central directory gets written
   1272         # even if no files are added to the archive
   1273         self._didModify = True

File /usr/lib/python3.10/zipfile.py:1336, in ZipFile._RealGetContents(self)
   1334     raise BadZipFile("File is not a zip file")
   1335 if not endrec:
-> 1336     raise BadZipFile("File is not a zip file")
   1337 if self.debug > 1:
   1338     print(endrec)

BadZipFile: File is not a zip file

I dug around and saw that the data is attempt to be retrieved from here : https://[github.com/chadwickbureau/baseballdatabank/archive/master.zip](https://github.com/chadwickbureau/baseballdatabank/archive/master.zip)

That is leading to a dead link. Perhaps there was a change upstream.

JSCjr commented 7 months ago

Similar issues - code will need update to handle new Chadwick register location and file structure (the people table has been split into multiple files).

blue-shoes commented 5 months ago

This is a separate issue from the Chadwick register (which I believe has been handled in PR #309 ). The issue looks like the chadwickbureau/baseballdatabank repository no longer exists, at least not publicly.

agpolivka commented 2 months ago

Has this issue been fixed? Dug into the code and came to the same conclusion that finally got me to this page but I don't see any follow up/fix. I've pulled the code pretty recently so I was wondering if anyone had fixed or come up with the work around.

JSCjr commented 2 months ago

Sean Lahman just posted an updated version of the database files at his own site, so this could presumably be fixed by pointing the code at those files instead.

blue-shoes commented 2 months ago

Linking to the files on his site looks fragile to me, since it's relying on naming convention in his personal Dropbox. The file is currently called lahman_1871-2023.csv, so one assumes this is not a static file name/path.

StuffbyYuki commented 2 months ago

I see the same error. And looks like the file location changed as @JSCjr mentioned.

SushiInYourFace commented 1 week ago

I'm also seeing this error. If this isn't an important functionality or a priority to maintain, might be a good idea to just remove it instead of keeping a broken function around

bdilday commented 6 days ago

note that there is a proposed fix here https://github.com/jldbc/pybaseball/pull/435