rdpharr / project_notes

Notes about my data projects
https://rdpharr.github.io/project_notes/
Apache License 2.0
3 stars 3 forks source link

MLB Training Data | rdpharr’s projects #5

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

MLB Training Data | rdpharr’s projects

Part 2 - Downloading the data and training the model with XGBoost

https://rdpharr.github.io/project_notes/baseball/webscraping/xgboost/brier/accuracy/calibration/machine%20learning/2020/09/21/MLB-Part2-First-Model.html

jackdahms commented 3 years ago

Did you consider accounting for double headers?

jackdahms commented 3 years ago

There are so few double headers I don't think it would make a difference, but I'm curious

rdpharr commented 3 years ago

I've experimented with a few features that target double headers. They are easy to identify in the training data - the last number in the game_id changes on baseball-reference.com.

The problems comes on the prediction side. It's hard to figure out which game is which when looking at today's games, and the pitching lineup is also sometimes wonky. Same is true when placing bets - it's easy to bet on the wrong game. I've done that about 5 times. So I swore them off.

I'm sure it's all solvable, but I haven't dived deep because it looks like a bunch of custom code for not a lot of games.

rkhal commented 3 years ago

After the encode_me step, all the string parameters in the df (e.g., home team abbreviation) become NaN. How can I fix this?

rkhal commented 3 years ago

Also interested in how I could use the data in the df to predict over-unders

JayDoubleOh7 commented 2 years ago

I'm getting a call back error when running this section: df = game_df

df = pd.merge(left=df, right = get_diff_df(batting_df, 'batting'), on = 'game_id', how='left') print(df.shape)

df = pd.merge(left=df, right = get_diff_df(pitching_df, 'pitching'), on = 'game_id', how='left') print(df.shape)

df = pd.merge(left=df, right = get_diff_df(pitcher_df, 'pitcher',is_pitcher=True), on = 'game_id', how='left') df.shape

any thoughts on what's wrong? I just sent you a twitter message as well.

domurray commented 2 years ago

Did you ever figure out the call back error? Thank you.

JayDoubleOh7 commented 2 years ago

Did you ever figure out the call back error? Thank you.

Nope. Never got a response. If you figure it out please let me know.

Stannis-Analysis commented 2 years ago

Incredible series. Also learning a lot. @JayDoubleOh7 in the event you catch this the issue is has to do with .astype(np.timedelta64)) you need to declare format. adding .astype(np.timedelta64(0,'s'))) should get the code working again; I myself am just learning, but this should allow for continued study. Good luck everyone!

sbarry0507 commented 1 year ago

I get same error as JayDoubleOh7. " ValueError: datetime64/timedelta64 must have a unit specified" Please reply and help with how to fix this. I tried @stannis-Analysis suggestion but could not get it to work

reentercaptcha commented 1 year ago

Having a lot of errors here: import datetime as dt game_data = [] for link in game_links: url = 'https://www.baseball-reference.com' + link game_data.append(process_link(url)) if len(game_data)%1000==0: print(dt.datetime.now().time(), len(game_data))

anyone?

dunaevv commented 1 year ago

Incredible series. Also learning a lot. @JayDoubleOh7 in the event you catch this the issue is has to do with .astype(np.timedelta64)) you need to declare format. adding .astype(np.timedelta64(0,'s'))) should get the code working again; I myself am just learning, but this should allow for continued study. Good luck everyone!

I have TypeError: incompatible index of inserted column with frame index

Kreuter97 commented 12 months ago

Anyone figure out how to get rid of the errors here?

import datetime as dt game_data = [] for link in game_links: url = 'https://www.baseball-reference.com' + link game_data.append(process_link(url)) if len(game_data)%1000==0: print(dt.datetime.now().time(), len(game_data))

itsRabb commented 4 months ago

Still looking to figure out this issue

import datetime as dt game_data = [] for link in game_links: url = 'https://www.baseball-reference.com' + link game_data.append(process_link(url)) if len(game_data)%1000==0: print(dt.datetime.now().time(), len(game_data))

---------------------------------------------------------------------------

IndexError Traceback (most recent call last) Cell In[62], line 5 3 for link in game_links: 4 url = 'https://www.baseball-reference.com/' + link ----> 5 game_data.append(process_link(url)) 6 if len(game_data)%1000==0: print(dt.datetime.now().time(), len(game_data))

Cell In[60], line 45 41 uncommented_html += h + '\n' 43 soup = bs(uncommented_html) 44 data = { ---> 45 'game': get_game_summary(soup, game_id), 46 'away_batting': get_table_summary(soup, 1), 47 'home_batting':get_table_summary(soup, 2), 48 'away_pitching':get_table_summary(soup, 3), 49 'home_pitching':get_table_summary(soup, 4), 50 'away_pitchers': get_pitcher_data(soup, 3), 51 'home_pitchers': get_pitcher_data(soup, 4) 52 } 53 return data

Cell In[60], line 7 5 scorebox = soup.find('div', {'class':'scorebox'}) 6 teams = scorebox.findAll('a',{'itemprop':'name'}) ----> 7 game['away_team_abbr'] = teams[0]['href'].split('/')[2] 8 game['home_team_abbr'] = teams[1]['href'].split('/')[2] 9 meta = scorebox.find('div', {'class':'scorebox_meta'}).findAll('div')

IndexError: list index out of range

highrollerz8 commented 2 months ago

HAS ANYONE FIGURE OUT THIS ERROR? I GOT STUCK AT A FEW STEPS USING MY LOCAL SERVER SO I SWITCH TO COLAB AND THE AI CANT EVEN FIGURE THIS OUT. HOPEFULLY SOMEONE REACH OUT AFTER SEEING THIS

import datetime as dt game_data = [] for link in game_links: url = 'https://www.baseball-reference.com' + link game_data.append(process_link(url)) if len(game_data)%1000==0: print(dt.datetime.now().time(), len(game_data))