kyleskom / NBA-Machine-Learning-Sports-Betting

NBA sports betting using machine learning
1.12k stars 422 forks source link

Data Leakage #249

Open nova-land opened 1 year ago

nova-land commented 1 year ago

The use of tf.keras.utils.normalize will provide invalid test result by normalising the whole dataset.

An evaluation script is required to verify the accuracy of the model

kyleskom commented 1 year ago

I don't understand what the issue here is?

chriseling commented 1 year ago

I think the worry is the normalize is applied to the whole data set which could potentially overfit the model because the validation data is also normalized. Best, Chris

On Sun, Jun 4, 2023 at 8:48 AM Kyle Skompinski @.***> wrote:

I don't understand what the issue here is?

— Reply to this email directly, view it on GitHub https://github.com/kyleskom/NBA-Machine-Learning-Sports-Betting/issues/249#issuecomment-1575616982, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVP5WZGNGLQYZS6PVW3IATXJSU35ANCNFSM6AAAAAAYRZ5LRA . You are receiving this because you are subscribed to this thread.Message ID: <kyleskom/NBA-Machine-Learning-Sports-Betting/issues/249/1575616982@ github.com>

kyleskom commented 1 year ago

Ill take a look when I revisit this next season

kyleskom commented 9 months ago

Hi looking for more info on what the potential fix for this would be. Thank you.

nova-land commented 9 months ago

You will need to separate train and test data when you are using tf.keras.utils.normalize. But normally you should use Scaler in scikit-learn to separate train and test data, fit the train data then transform both train and test data.

STRATZ-Ken commented 8 months ago

I am not sure I agree with @nova-land. The idea of normalize is to set the data for the entire dataset equally. Imagine you have a data set that has values of [3,1,0.50] and you normalize this. It would change to [1, .33, .165]. If your next dataset has a higher value, it would adjust based on the highest data on the column.

There are keras layers you can do which will normalize the data inside the model itself, which would not require this function to be called. Or you can normalize the data when it comes in, setting max values. For example, if a player scores 56 points, and your goal is predict how many points a player is going to score from 0 to 50 (Your force normalizing here), then the max he can score is 50. Just an example.

I am not an expert here, but you have to make sure you have this code in your training set. Then when your ready to predict, you load these values and send the predictions through the normalize function as well.

if not os.path.exists(model_dir + '/scaler.pkl'):
        joblib.dump(min_max_scaler, model_dir + '/scaler.pkl')
STRATZ-Ken commented 8 months ago

Here is information on the normalize layer. You would add this before your first dense layer, this will normalize the incoming data and store its weights inside the model file itself. Then you would not have to make any changes to the data or even call MinMax normalize within the file itself.

https://keras.io/api/layers/normalization_layers/batch_normalization/

Also worth noting, this is for the NN model, not XGBoost.

Gxent commented 8 months ago

but then which would be better xg or nn model?

STRATZ-Ken commented 8 months ago

Better is not a good word at all to use in models. There are a million factors. That question cannot be answered.

Gxent commented 8 months ago

Okay, put another way. What probability would be closest since I made $2,000 in two weeks via XGboost with just a $10 stake. in the end season in May, and I didn’t pay attention to the NN model...

Gxent commented 8 months ago

so I always relied on over and under

cafeTechne commented 6 months ago

Okay, put another way. What probability would be closest since I made $2,000 in two weeks via XGboost with just a $10 stake. in the end season in May, and I didn’t pay attention to the NN model...

How's this working out for you now?

Gxent commented 6 months ago

this year wasn't so good

cafeTechne commented 6 months ago

this year wasn't so good

So you're not seeing 55% win rates with this strategy?