jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.19k stars 324 forks source link

Lahman functions need to re-download the DB every time #29

Closed bentrevett closed 5 years ago

bentrevett commented 5 years ago

Really like this library, but one thing I don't get is why the Lahman DB needs to be re-downloaded every time you try and use a function interfacing with the Lahman DB.

I'm proposing something like this:

def get_lahman_zip():
    if os.path.exists(base_string):
        z = None
    else:
        s = requests.get(url,stream=True)
        z = zipfile.ZipFile(BytesIO(s.content))
    return z

And then all Lahman interfacing functions can be edited like so:

def parks():
    z = get_lahman_zip()
    f = os.path.join(base_string, "Parks.csv")
    data = pd.read_csv(f if z is None else z.open(f), header=0, sep=',', quotechar="'")
    return data

This way you only have to call download_lahman once and every subsequent time you call parks() it will just use the downloaded DB.

This probably isn't the most elegant way to do it, but I think something like this would be a good idea.

Happy to discuss, do the changes myself and file a pull request!

schorrm commented 5 years ago

Great idea. I'd suggest also making it specific to the current version - so when the new Lahman comes out it'll download the new even if old is there

jldbc commented 5 years ago

I like it @bentrevett. Want to submit a PR?

Also +1 to @schorrm's comment. Is the naming convention consistent year over year? If so, it would be an improvement to use some safe rule for when we know it will update to the next season's version and increment the year used in the url.

bentrevett commented 5 years ago

Yep, I'll do it tonight.

I'm not sure the Lahman DB has consistent naming.

From http://www.seanlahman.com/baseball-archive/statistics/ I can see that the 2015 and 2016 versions are named http://seanlahman.com/files/database/baseballdatabank-master_2016-02-29.zip and http://seanlahman.com/files/database/baseballdatabank-master_2016-03-02.zip.

Seems to just be the date they were uploaded (?).

One possible solution is to instead get the data from https://github.com/chadwickbureau/baseballdatabank. This seems to be in the same format as the Lahman DB and is frequently updated.

schorrm commented 5 years ago

What if we have the current version as a string and when Lahman updates just push the new version to a package update? A bit clunky, but barring better version management there...

jldbc commented 5 years ago

That's what it's currently doing, I think hardcoding it is fine. I'd rather this stay up and be out of date for a while before we manually update the url than try to guess next year's naming convention and break it.

jldbc commented 5 years ago

41 addresses this - thanks to both of you for the fix!