Feedback on proposal - Githubissues

@elvinlee12 @michaelparker7 @Brandon4106

Cool idea! Pretty solid write up but let's focus the revision on clarifying the plan. There is lots for you to think about. For example, you say "Observations: Each of the 30 NBA teams and each of their respective props" ... this implies a dataset with 30 rows. That's not what you mean!

Do you mean season long bets or game bets? (Pick one, for simplicity. Season long bets... let's avoid those for now! I can explain why if desired.) If the latter - game bets, which I think you mean, then you'll want "team-game" observations, so your dataset will be 3082(# of seasons) long.
Gambling data: You can scrape game outcomes and lines here: https://www.oddsportal.com/basketball/usa/nba-2022-2023/results/
- That's just one option. I'm not sure of those are opening or closing lines. (Closing?)
- Not sure what this is: https://www.sportsbookreviewsonline.com/scoresoddsarchives/nba-odds-2022-23/
Predictor variables: For each game, you need stats on the team that are created before the game. You can't just grab season long stats and merge them with the gambling data!!
- Getting info on the refs for each game, and injuries to players should be helpful for your goals
- You'll want variables like "team strength as of gameday" (there are many versions of this and you can specify your own - like a running tally of a team's net scoring margin, ELO rating, these can be made from publically available data)
- Some ideas for injury variables, from basic to better: "how many players hurt" "how many starters hurt" "PPG of players missing"
- How many days since each team's last game? How many miles did the team travel last night?
Try some fancy ML models like XGBoost - these allow for complicated interactions of variables automatically. Like: The road team traveled last night and their best starter is out. The home team is 15% stronger by ELO rating but account for their top 2 players are missing.
You probably aren't maximizing R2. If you modelling focuses on game bets relative to the spread, accuracy and R2 are going to be roughly equivalent. 55%+ on spread means you are elite enough to turn a profit. But if you focus on moneyline, then you'll want to focus on profits! Not just getting the calls right.

The above is enough for a project. Examining player props is another thing entirely, so avoid it for now. However, the market for game over/under and totals is very competitive and close to efficient. The player prop market likely has more profit opportunities. You'd want to model things like expected minutes played and other things to feed into a player prop model. The main issue for getting rich on player props is that you can't bet much into these markets.

It's fine to manually download data. It's not the preferred, but if it's all that's viable, fine.

I'm still unclear about your plan. I think, but I have to read between the lines and guess, that you'll model something like HowMuchWillTheHomeTeamWinBy as the y, and X is all these stats about the team and the same stats for the opponent and maybe the difference in those stats.

You'll need stats for all 30 teams, even though you have gambling info only on 4 teams.

Run Machine learning - X_train is the information given before the game (team stats, team strength, player stats, injuries), and y_train is the results of the games

Ok, good. But not enough details about this process. Your splitting method should probably be: drop last 3 weeks of games (lots of resting and injuries and tanking and shinanigans), and the holdout is the prior month. The months before are training.

CV method will need to be NOT kfold, you need to use a time series style of splitting.

Scoring... you don't explain this, not rigorously. Maybe you have a gambling rule like, if HowMuchWillTheHomeTeamWinBy > line, bet on the home team, else bet road. Then it's as simple as computing the returns on the chosen bet at the available odds.

michaelparker7 / FIN-377-Final-Project

Feedback on proposal #1