shaivimalik / medicine_preprocessing-on-entire-dataset

Reproducing "Characterization of Term and Preterm Deliveries using Electrohysterograms Signatures"
MIT License
0 stars 0 forks source link

Hyperparameter values(SVC) for toy example - 02.ipynb / adult income dataset notebook #5

Closed shaivimalik closed 1 month ago

shaivimalik commented 3 months ago

The hyperparameter values for SVC in 02.ipynb were obtained using GridSearchCV

The code snippets given below can be used to validate the findings:

For Training SVM - with Data Leakage:

from sklearn.model_selection import GridSearchCV
# Define parameters for grid search
gamma_range = np.logspace(-5, 4, 10)
C_range = np.logspace(-5, 4, 10)
parameters = {'C': C_range, 'gamma': gamma_range}

# Initialize SVM model
svc = SVC(kernel='rbf', random_state=15)

# Define GridSearchCV with custom scorers
clf = GridSearchCV(svc, parameters, cv=10, scoring='accuracy')

# Perform grid search
clf.fit(X_train, y_train)

# Print results
print("Accuracy:", clf.best_score_)
print("Best hyperparameters:", clf.best_params_)

For Training SVM - without Data Leakage:

from sklearn.model_selection import GridSearchCV
# Define parameters for grid search
gamma_range = np.logspace(start=-5, stop=4, num=10, base=10)
C_range = np.logspace(start=-5, stop=4, num=10, base=10)
parameters = {'C': C_range, 'gamma': gamma_range}

# Initialize SVM model
svc = SVC(kernel='rbf', random_state=15)

# Define GridSearchCV with custom scorers
clf = GridSearchCV(svc, parameters, cv=10, scoring='accuracy')

# Perform grid search
clf.fit(X_train_oversamp, y_train_oversamp)

# Print results
print("Accuracy:", clf.best_score_)
print("Best hyperparameters:", clf.best_params_)
shaivimalik commented 1 month ago

Due to data leakage from the training set to the validation set within GridSearchCV, modified approach for finding optimal hyperparameters for training SVM - without Data Leakage:

kfold = sklearn.model_selection.StratifiedKFold(n_splits=10, shuffle=True, random_state=15)

# Define the parameter grid for GridSearch
gamma_range = np.logspace(start=-5, stop=5, num=11, base=10)
C_range = np.logspace(start=-5, stop=5, num=11, base=10)

# Initialize array to store mean val scores
mean_val_score = np.zeros((C_range.shape[0], gamma_range.shape[0]))

# Perform nested cross-validation for hyperparameter tuning
for idx, (train_index_opt, val_index) in enumerate(kfold.split(X_train, y_train)):

    # Split training data
    X_train_opt, y_train_opt = X_train[train_index_opt], y_train[train_index_opt]
    X_val_opt, y_val_opt = X_train[val_index], y_train[val_index]

    # Initialize and fit the MinMaxScaler on training data only
    scaler = MinMaxScaler()
    scaler.fit(X_train_opt)

    # Transform both training and validation sets using the scaler fitted on training data
    X_train_opt = scaler.transform(X_train_opt)
    X_val_opt = scaler.transform(X_val_opt)

    # Create an instance of the SMOTE oversampler
    oversampler = SMOTE()

    # Apply oversampling
    X_train_opt_oversampled, y_train_opt_oversampled, _ = oversampler.sample(X_train_opt, y_train_opt)

    # Grid search over C and gamma parameters
    for i in range(C_range.shape[0]):
        for j in range(gamma_range.shape[0]):
            svc_opt = SVC(kernel='rbf', C=C_range[i], gamma=gamma_range[j], random_state=15)
            svc_opt.fit(X_train_opt_oversampled, y_train_opt_oversampled)
            y_pred_opt = svc_opt.predict(X_val_opt)
            mean_val_score[i,j] += accuracy_score(y_val_opt, y_pred_opt)

# Calculate mean test score across all inner folds
mean_val_score = mean_val_score/kfold.get_n_splits()

# Find best hyperparameters
C_index, gamma_index = np.unravel_index(np.argmax(mean_val_score, axis=None), mean_val_score.shape)
print("C:",C_range[C_index])
print("gamma:", gamma_range[gamma_index])
print("Validation accuracy:", mean_val_score[C_index, gamma_index])