heal-research / pyoperon

Python bindings and scikit-learn interface for the Operon library for symbolic regression.
MIT License
34 stars 10 forks source link

Feature request: Validation data metrics for model selection #19

Open romanovzky opened 3 weeks ago

romanovzky commented 3 weeks ago

Currently, SymbolicRegressor returns a model that better complies with a certain criteria. This, however, is computed on the training set. Machine learning best practices dictate that model selection should be done using a validation set. Currently, this can be "hacked" by selecting the best pareto front individual against a validation metric after the SymbolicRegressor completes its run. However, with callbacks (see https://github.com/heal-research/pyoperon/issues/18) this feature could allow for earlystop criteria using the validation set. This is common in machine learning packages with iterative training (see Keras, Lightning, xGboost, etc for examples).

gkronber commented 3 weeks ago

I like the idea of using the callback mechanism for this, so that users have different options for model selection. Selecting based on a validation set could be a good default. Other options are selection based on criteria such as Bayesian evidence, AIC, BIC or description length, but these could be easily added by users once the callback mechanism is in place.