skorch-dev / skorch

A scikit-learn compatible neural network library that wraps PyTorch
BSD 3-Clause "New" or "Revised" License
5.76k stars 386 forks source link

Skorch inference in cross validation #1062

Open faridehm opened 1 month ago

faridehm commented 1 month ago

Hello every body I 'm using cross validation(cv) for a classification problem, I spilt my data into test and train, I used train for cv model, and test data for inference step. my code for cv and early stopping at the same time is :

net = NeuralNetClassifier(
    module = SimpleNN,
    max_epochs = 300,
    lr = 0.001,
    train_split=False,
    # train_split=predefined_split(valid_ds)
    # module__dropout=0.2,
    iterator_train__batch_size = 10,
    iterator_train__shuffle = True,
    iterator_valid__batch_size =10,
    iterator_valid__shuffle = False,
    criterion = nn.BCEWithLogitsLoss(weight=pos_weight),
    optimizer = torch.optim.AdamW,
    optimizer__weight_decay=0.01,
    callbacks = [EarlyStopping(patience=5, monitor='train_loss')],
    device = device
)
# Train the model
print('Using...', device)
print("Training started...")
from sklearn.metrics import make_scorer
scoring = {'prec_macro': 'precision_macro',
           'rec_macro': make_scorer(recall_score, average='macro')}
scores = cross_validate(net, X_train.to(torch.float64), y_train, scoring='accuracy', return_train_score=True, cv=5,error_score='raise' )
sorted(scores.keys())
scores
print("Training completed!")

now when I try to save the model as following code, it return erroe "NotInitializedError: Cannot save state of an un-initialized model. Please initialize first by calling .initialize() or by fitting the model with .fit(...)."

net.save_params(
    f_params='model.pkl', f_optimizer='opt.pkl', f_history='history.json')
new_net = NeuralNetClassifier(
    module = SimpleNN,
    max_epochs = 300,
    lr = 0.001,
    train_split=False,
    # train_split=predefined_split(valid_ds)
    # module__dropout=0.2,
    # train_split=predefined_split(dataset_valid),
    iterator_train__batch_size = 10,
    iterator_train__shuffle = True,
    iterator_valid__batch_size =10,
    iterator_valid__shuffle = False,
    criterion = nn.BCEWithLogitsLoss(weight=pos_weight),
    optimizer = torch.optim.AdamW,
    optimizer__weight_decay=0.01,
    callbacks = [EarlyStopping(patience=5, monitor='train_loss')],
    device = device
)

new_net.initialize() # This is important!
new_net.load_params(
    f_params='model.pkl', f_optimizer='opt.pkl', f_history='history.json')

new_net.fit(np.array(X_test, dtype=float), y_test) 

Is it reliable this code for cv? I want to save 5 separated model for 5 fold CV, but I could not find any related document, appreciated any advice ..

BenjaminBossan commented 1 month ago

Okay, so if I understand your question correctly, you have run a grid search and determined the best hyper-parameters. Now you would like to take those best parameters and train a new net with these values.

One way to achieve this is to simply redefine the net and pass the parameters as returned by grid_search.best_params_. You can also do net.set_params(**grid_search.best_params_).

However, it appears that you also try to load the net params (not the hyper-params) of the best model. This should not be done. Instead, create a new net and fit it on the whole training data (never fit on test data!). In simplified code, this would look something like this:

# setup
X_train, y_train, X_test, y_test = ...
hyper_params = {...}
net = NeuralNetClassifier(...)

# perform grid search
grid_search = GridSearchCV(net, hyper_params, scoring=scoring, ...)
grid_search.fit(X_train, y_train)

# apply best hyper-params to the net
print("Applying best hyper-parameters:", grid_search.best_params_)
net.set_params(**grid_search.best_params_)

# train the net with the best hyper-params using training data
net.fit(X_train, y_train)

# now evaluate the model on the test data
y_pred = net.predict(X_test)
from sklearn.metrics import precision_score
prec_macro = precision_score(y_test, y_pred, average="macro")
...
faridehm commented 1 month ago

Okay, so if I understand your question correctly, you have run a grid search and determined the best hyper-parameters. Now you would like to take those best parameters and train a new net with these values.

One way to achieve this is to simply redefine the net and pass the parameters as returned by grid_search.best_params_. You can also do net.set_params(**grid_search.best_params_).

However, it appears that you also try to load the net params (not the hyper-params) of the best model. This should not be done. Instead, create a new net and fit it on the whole training data (never fit on test data!). In simplified code, this would look something like this:

# setup
X_train, y_train, X_test, y_test = ...
hyper_params = {...}
net = NeuralNetClassifier(...)

# perform grid search
grid_search = GridSearchCV(net, hyper_params, scoring=scoring, ...)
grid_search.fit(X_train, y_train)

# apply best hyper-params to the net
print("Applying best hyper-parameters:", grid_search.best_params_)
net.set_params(**grid_search.best_params_)

# train the net with the best hyper-params using training data
net.fit(X_train, y_train)

# now evaluate the model on the test data
y_pred = net.predict(X_test)
from sklearn.metrics import precision_score
prec_macro = precision_score(y_test, y_pred, average="macro")
...

Many thanks for your tips, if I understand, cross validation mainly use for tuning the best parameter and its application for preventing over fitting is not common and how to use the final model on new data is not well addressed. As model with best parameters on train data may not best on new data. I tried 5 different models for 5 fold cv in pytorch, all accuracy on the training data was above 70 %, but some of theses model had acc under 50 and one of them was 32% on new data, of course in final acc we select best model on new data. So is it enough to save best model in cross validation on train data or we need to save all the models?

BenjaminBossan commented 1 month ago

Just to be clear, grid search (and similar methods like randomized search) are intended to figure out the best hyper-parameters, which I think is what you intend to do here. They do this by trying out a bunch of different sets of hyper-parameters and for each set run a cross validation (which you could do manually with cross_validate). How good a model performs is determined on the cv split, where the training data is split into train and validation multiple times per cross validation.

Preventing overfitting is not necessarily a goal in and of itself. E.g. you could use a super small model with only 1 parameter, which will probably not overfit, but this model will also be very bad overall. What you most likely want is to have a model that works really well on the real world data. This model could be overfitting, but do you really care that much if it still works really well? For many problems, you can basically ignore the training scores and just look at the validation scores that GridSearchCV or cross_validate return. If the model overfits, it's more of an indicator that you could further improve it, say, by lowering the learning rate. I would prefer a model that overfits but has a good validation score to a model that does not overfit but has a bad validation score.

I tried 5 different models for 5 fold cv in pytorch, all accuracy on the training data was above 70 %, but some of theses model had acc under 50 and one of them was 32% on new data

Is it the accuracy on the training data? Or on the validation data? It's important to be precise here. Remember that that sklearn will generally report validation data when calling GridSearchCV or cross_validate.

Moreover, it sounds like you have a separate test set that you use to check the final model, right? Of course, this is a good thing. However, if you find that the validation score and the test score are very different, this is a bad sign. There could be a couple of reasons why they are different: Your dataset size could be too small, the split between train/validation/test could not be random or not representative, leading to bias, there could be leakage or duplicates in the data, etc. Before you proceed, you should figure out why validation scores and test scores are so different.

of course in final acc we select best model on new data.

Just make sure that you don't select the model based on the best score on the test data, as this leads to overfitting on the test data.

So is it enough to save best model in cross validation on train data or we need to save all the models?

This is a misunderstanding. Don't use any of the models that are trained for cross validation or grid search. These models are only there to get an evaluation of how well the model works. Once you have that, train a model with the best hyper-parameters that you determined earlier using the training data, and finally evaluate it on the test set.

faridehm commented 1 month ago

Is it the accuracy on the training data? Or on the validation data?

Thanks for your reply, yes it's validation accuracy.

faridehm commented 1 month ago

However, if you find that the validation score and the test score are very different, this is a bad sign. There could be a couple of reasons why they are different: Your dataset size could be too small, the split between train/validation/test could not be random or not representative, leading to bias, there could be leakage or duplicates in the data, etc. Before you proceed, you should figure out why validation scores and test scores are so different.

My sample size is small, total of 616 , 437 for cross validation and 179 for test the final model. And I have duplicate data too, but I split data based on ID and of course I tried the model with removing duplicate data, the results are not different. The train and validation accuracy are different as the classes are imbalance , I managed it by loss function but still test accuracy is different.

faridehm commented 1 month ago

Just make sure that you don't select the model based on the best score on the test data, as this leads to overfitting on the test data.

why not? Which model should be selected?

faridehm commented 1 month ago

Don't use any of the models that are trained for cross validation or grid search. These models are only there to get an evaluation of how well the model works. Once you have that, train a model with the best hyper-parameters that you determined earlier using the training data, and finally evaluate it on the test set.

If I get your recommendation, you mean selection of model should be done after testing models on final test data.

BenjaminBossan commented 1 month ago

My sample size is small, total of 616 , 437 for cross validation and 179 for test the final model.

This is indeed very small and I assume you cannot easily get more data. You could try increasing the number of cv splits to make the results more robust.

The train and validation accuracy are different as the classes are imbalance

If you have control over the splits, you could try using a stratified split to reduce class imbalance.

why not? Which model should be selected?

Select the model based on the best validation score (i.e. data split off from the train dataset). If you use the test data for model selection, it effectively becomes validation data and thus your model will not generalize as well as it could.

If I get your recommendation, you mean selection of model should be done after testing models on final test data.

The steps would be as follows:

  1. Split train and test data (if the split is not already predefined)
  2. Use the train data to perform a grid search (or randomized search or other type of hyper-parameter search)
  3. The grid search will split train data into train and validation data and report the validation scores for each set of hyper-parameters and each split
  4. Pick the set of hyper-parameters that has the best validation score averaged across all splits
  5. Train your model on the whole training data using these hyper-parameters
  6. Calculate the test score of this trained model based on the test data

Don't make further adjustments to the model based on the test scores or else you will overfit on the test set.

faridehm commented 1 month ago

Thanks again for your tips and clarification.