Open donbowen opened 7 months ago
It's fine to manually download data. It's not the preferred, but if it's all that's viable, fine.
I'm still unclear about your plan. I think, but I have to read between the lines and guess, that you'll model something like HowMuchWillTheHomeTeamWinBy as the y, and X is all these stats about the team and the same stats for the opponent and maybe the difference in those stats.
You'll need stats for all 30 teams, even though you have gambling info only on 4 teams.
Run Machine learning - X_train is the information given before the game (team stats, team strength, player stats, injuries), and y_train is the results of the games
Ok, good. But not enough details about this process. Your splitting method should probably be: drop last 3 weeks of games (lots of resting and injuries and tanking and shinanigans), and the holdout is the prior month. The months before are training.
CV method will need to be NOT kfold, you need to use a time series style of splitting.
Scoring... you don't explain this, not rigorously. Maybe you have a gambling rule like, if HowMuchWillTheHomeTeamWinBy > line, bet on the home team, else bet road. Then it's as simple as computing the returns on the chosen bet at the available odds.
@elvinlee12 @michaelparker7 @Brandon4106
Cool idea! Pretty solid write up but let's focus the revision on clarifying the plan. There is lots for you to think about. For example, you say "Observations: Each of the 30 NBA teams and each of their respective props" ... this implies a dataset with 30 rows. That's not what you mean!
Do you mean season long bets or game bets? (Pick one, for simplicity. Season long bets... let's avoid those for now! I can explain why if desired.) If the latter - game bets, which I think you mean, then you'll want "team-game" observations, so your dataset will be 3082(# of seasons) long.
Gambling data: You can scrape game outcomes and lines here: https://www.oddsportal.com/basketball/usa/nba-2022-2023/results/
Predictor variables: For each game, you need stats on the team that are created before the game. You can't just grab season long stats and merge them with the gambling data!!
Try some fancy ML models like XGBoost - these allow for complicated interactions of variables automatically. Like: The road team traveled last night and their best starter is out. The home team is 15% stronger by ELO rating but account for their top 2 players are missing.
You probably aren't maximizing R2. If you modelling focuses on game bets relative to the spread, accuracy and R2 are going to be roughly equivalent. 55%+ on spread means you are elite enough to turn a profit. But if you focus on moneyline, then you'll want to focus on profits! Not just getting the calls right.
The above is enough for a project. Examining player props is another thing entirely, so avoid it for now. However, the market for game over/under and totals is very competitive and close to efficient. The player prop market likely has more profit opportunities. You'd want to model things like expected minutes played and other things to feed into a player prop model. The main issue for getting rich on player props is that you can't bet much into these markets.