weak performance - Githubissues

MohammadSoltani100 commented 10 months ago

Dear RILS-ROLS Developers,

I trust this message finds you well. I am reaching out to express my interest in your symbolic regression model, RILS-ROLS, and to seek guidance on its optimal usage.

Recently, I conducted an experiment comparing the performance of machine learning models, including XGBoost and Random Forest, on a shared dataset. Unfortunately, the RILS-ROLS regressor's results fell short of my expectations compared to other models.

Recognizing RILS-ROLS as a pioneering symbolic regression technique, I am eager to enhance my utilization of its capabilities. Attached is the code used in the experiment, and I am seeking your guidance to identify potential mistakes or areas for improvement.

I value the innovation behind RILS-ROLS and believe your insights can significantly improve my understanding of this tool. Your assistance in addressing any issues or providing suggestions for improvement would be highly appreciated.

Thank you for your time and consideration. I eagerly await your expert advice.

Best Regards,

from rils_rols.rils_rols import RILSROLSRegressor from rils_rols.rils_rols_ensemble import RILSROLSEnsembleRegressor import numpy as np from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

regressors = [RILSROLSRegressor(max_fit_calls = 100000, max_seconds = 250, complexity_penalty = 0.1, max_complexity = 200, sample_size = 0.3, validation_size = 0.3, estimator_cnt = 10, verbose = False, random_state = 0), RILSROLSEnsembleRegressor(max_fit_calls_per_estimator = 100000, max_seconds_per_estimator = 250, complexity_penalty = 0.1, max_complexity = 200, sample_size = 0.3, validation_size = 0.3, estimator_cnt = 10, verbose = False, random_state = 0)]

for regressor in regressors: regressor.fit(X_train_sc, y_train_sc)

# Fit sr and predict it for the standardized train set
srr = regressor.predict(X_train_sc)
srr = yscaler.inverse_transform(srr.reshape(-1, 1))

nse = he.evaluator(he.nse, y_train_sc, srr)
kge, r, alpha, beta = he.evaluator(he.kge, y_train_sc, srr)
mbe = np.mean(y_train_sc.ravel() - srr.ravel())

print('Train Set Statistics for {}:'.format(regressor.__class__.__name__))
print('R2:', r2_score(y_train_sc, srr))
print('RMSE:', np.sqrt(mean_squared_error(y_train_sc, srr)))
print('RRMSE', (np.sqrt(mean_squared_error(y_train,srr)))/(np.mean(y_train)))
print("RMSLE",np.log(np.sqrt(mean_squared_error(y_train,srr))))
print ('MSE', mean_squared_error(y_train, srr))
print ('MAE', mean_absolute_error(y_train, srr))
print('MAPE', mean_absolute_percentage_error(y_train, srr))
print('MBE',  mbe)
print('NSE', nse)
print('KGE',kge )
print('R', r)
print('ALPHA', alpha)

# Predict the standardized test set and reverse the standardization
srt = regressor.predict(X_test_sc)
srt = yscaler.inverse_transform(srt.reshape(-1, 1))

nse = he.evaluator(he.nse, y_test_sc, srt)
kge, r, alpha, beta = he.evaluator(he.kge, y_test_sc, srt)
mbe = np.mean(y_test_sc.ravel() - srt.ravel())

print('\nTest Set Statistics for {}:'.format(regressor.__class__.__name__))
print('R2:', r2_score(y_test_sc, srt))
print('RMSE:', np.sqrt(mean_squared_error(y_test_sc, srt)))
print('RRMSE', (np.sqrt(mean_squared_error(y_test,srt)))/(np.mean(y_test)))
print("RMSLE",np.log(np.sqrt(mean_squared_error(y_test,srt))))
print ('MSE', mean_squared_error(y_test, srt))
print ('MAE', mean_absolute_error(y_test, srt))
print('MAPE', mean_absolute_percentage_error(y_test, srt))
print('MBE',  mbe)
print('NSE', nse)
print('KGE',kge )
print('R', r)
print('ALPHA', alpha)

kartelj commented 10 months ago

Hello Mohammad,

Thank you for your interest in our algorithm.

RILS-ROLS version 1.2 (pip install rils-rols==1.2) , which you have probably used, was empirically tested against SRBench ground-truth instances. We didn't experimented with black-box instances, but it should work on those as well -- I suppose the difference is that small training sample of 1% (0.01) used in ground-truth is not suitable in that case. The drawback is that larger sample rate, e.g. 100% it will work very slowly.

Therefore, I recommend that you use large sample rate in combination with a new RILS-ROLS version (currently still work in progress) that is rewritten in C++. This version should be several hundred times faster than the previous pure python version.
You can install it with this, but be sure to uninstall the old version first or use option --force-reinstall: pip install rils-rols==1.5.4

The instructions on how to use this version are now a bit different, some parameters are removed, while others are added. Ensemble regressor is now removed, while we are working on building symbolic classifier. You can find the instructions on this branch (it will be merged to master soon!): https://github.com/kartelj/rils-rols/tree/binary-classifier-new

Please, let me know if you managed to install it and are the results better (there is an example there for the diabetes toy set). I suggest using default parameters (except the sample size, which is 10%, but set it to 100% in your case -- we will probably set 100% as default in the next version). You might also want to play with max_complexity parameter (default 200). When expressions are too large, they tend to overfit to data, so reducing this parameter might help as well.

In case you still have bad results, let us know, so we can test on your data and accordingly improve our method.

Best regards, Aleksandar

kartelj commented 10 months ago

Follow up on this:

I have run version 1.5.4 on black-box instances from SRBench and the results are quite good. You can find them attached. Information on used parameters are also inside.

Best regards, Aleksandar black-box.xlsx

kartelj commented 9 months ago

In the meantime, new C++ version was merged to master, and the minimal working example colab notebook is available in the readme.md. I suppose this issue is solved so I am closing it.

kartelj / rils-rols

weak performance #3