marcovirgolin / GP-GOMEA

Genetic Programming version of GOMEA. Also includes standard tree-based GP, and Semantic Backpropagation-based GP
Apache License 2.0
49 stars 23 forks source link

Wired prediction results in the cross-validation mode. #16

Closed hengzhe-zhang closed 1 year ago

hengzhe-zhang commented 1 year ago

I found a very weird phenomenon in the cross-validation mode. Using the cross-validation function in scikit-learn will make GP-GOMEA perform very badly. I believe it is a bug. Are you willing to have a check on this issue? Thanks!

Here is a minimal example:

X, y = load_diabetes(return_X_y=True)
X = StandardScaler().fit_transform(X)
X, y = np.array(X), np.array(y)
print(cross_val_score(GPGR(generations=20), X, y))
# [-2854.69069238 -2941.72780871 -3424.92171764 -2762.38271565 -2871.62515949]
marcovirgolin commented 1 year ago

Hey @hengzhe-zhang !

For cross-validation, GP-GOMEA uses the negative mean squared error: neg_mse = - MEAN_i[ (y_i - p_i)**2 ]. This is because sklearn wants a scoring metric where higher is better. If I remember right, this is (or maybe was) a standard choice for scoring regression models in sklearn.

So, to convert the scores to mean squared errors, you can multiply by -1. Do the numbers look right to you if you do that? Or do you think the numbers are still strange for this dataset?

hengzhe-zhang commented 1 year ago

Wow. Thanks a lot. I have changed the scoring function, and it works well now.