Closed shaivimalik closed 1 month ago
Due to data leakage from the training set to the validation set within GridSearchCV, modified approach for finding optimal hyperparameters for training SVM - without Data Leakage:
kfold = sklearn.model_selection.StratifiedKFold(n_splits=10, shuffle=True, random_state=15)
# Define the parameter grid for GridSearch
gamma_range = np.logspace(start=-5, stop=5, num=11, base=10)
C_range = np.logspace(start=-5, stop=5, num=11, base=10)
# Initialize array to store mean val scores
mean_val_score = np.zeros((C_range.shape[0], gamma_range.shape[0]))
# Perform nested cross-validation for hyperparameter tuning
for idx, (train_index_opt, val_index) in enumerate(kfold.split(X_train, y_train)):
# Split training data
X_train_opt, y_train_opt = X_train[train_index_opt], y_train[train_index_opt]
X_val_opt, y_val_opt = X_train[val_index], y_train[val_index]
# Initialize and fit the MinMaxScaler on training data only
scaler = MinMaxScaler()
# Transform both training and validation sets using the scaler fitted on training data
X_train_opt = scaler.transform(X_train_opt)
X_val_opt = scaler.transform(X_val_opt)
# Create an instance of the SMOTE oversampler
oversampler = SMOTE()
# Apply oversampling
X_train_opt_oversampled, y_train_opt_oversampled, _ = oversampler.sample(X_train_opt, y_train_opt)
# Grid search over C and gamma parameters
for i in range(C_range.shape[0]):
for j in range(gamma_range.shape[0]):
svc_opt = SVC(kernel='rbf', C=C_range[i], gamma=gamma_range[j], random_state=15), y_train_opt_oversampled)
y_pred_opt = svc_opt.predict(X_val_opt)
mean_val_score[i,j] += accuracy_score(y_val_opt, y_pred_opt)
# Calculate mean test score across all inner folds
mean_val_score = mean_val_score/kfold.get_n_splits()
# Find best hyperparameters
C_index, gamma_index = np.unravel_index(np.argmax(mean_val_score, axis=None), mean_val_score.shape)
print("gamma:", gamma_range[gamma_index])
print("Validation accuracy:", mean_val_score[C_index, gamma_index])
The hyperparameter values for SVC in 02.ipynb were obtained using GridSearchCV
The code snippets given below can be used to validate the findings:
For Training SVM - with Data Leakage:
For Training SVM - without Data Leakage: