heal-research / pyoperon

Python bindings and scikit-learn interface for the Operon library for symbolic regression.
MIT License
34 stars 10 forks source link

How to get the prediction of all models in the final population? #5

Closed hengzhe-zhang closed 1 year ago

hengzhe-zhang commented 1 year ago

Hi! I want to implement an ensemble model based on the final population or an external archive, like [1]. However, I cannot find a way to get the prediction of all models in the final population. Can you help me with this? Thanks a lot!

[1]. Zhang, Hengzhe, Aimin Zhou, and Hu Zhang. "An Evolutionary Forest for Regression." IEEE Transactions on Evolutionary Computation (2021).

foolnotion commented 1 year ago

Hi,

This is currently not supported by the sklearn module, however it is rather straightforward to add this yourself:

$ git diff pyoperon
diff --git a/pyoperon/sklearn.py b/pyoperon/sklearn.py
index 8e9e4c2..d8f5509 100644
--- a/pyoperon/sklearn.py
+++ b/pyoperon/sklearn.py
@@ -320,7 +320,7 @@ class SymbolicRegressor(BaseEstimator, RegressorMixin):

         model_vars = [self.inputs_[h] for h in hashes]
         names_map = {v.Hash : v.Name for v in model_vars} if names is None else {v.Hash : names[v.Index] for v in model_vars}
-        return op.InfixFormatter.Format(self.model_, names_map, precision)
+        return op.InfixFormatter.Format(model, names_map, precision)

     def fit(self, X, y):
@@ -475,6 +475,8 @@ class SymbolicRegressor(BaseEstimator, RegressorMixin):
             'random_state': self.random_state
         }

+        self.individuals_ = gp.Individuals
+
         self.is_fitted_ = True
         # `fit` should always return `self`
         return self

Once you've made the above changes to sklearn.py (you'll find this file in the install location of pyoperon which you can find with pip show pyoperon), you can retrieve the population like this:

reg = SymbolicRegressor(...)
reg.fit(X_train, y_train)
individuals = [reg.get_model_string(x.Genotype) for x in reg.individuals_ ]
parents = individuals[reg.population_size:]
offspring = individuals[:-reg.pool_size]

This will get you the infix strings for the parents and offspring (normally you wouldn't need the offspring, since by the time the algorithm is finished they are already merged back into the parent population).

If you just want the predictions, you can do:

models = [x.Genotype for x in reg.individuals_]
for m in models:
    values = reg.evaluate_model(m, X_train)

Alternatively, you could also directly use the python bindings without the sklearn interface, you can find an example here: https://github.com/heal-research/pyoperon/blob/cpp20/example/operon-bindings.py

foolnotion commented 1 year ago

This functionality will probably be included in the next release (pyoperon-0.3.5), in the meanwhile I will leave this issue open in case you run into any trouble.

hengzhe-zhang commented 1 year ago

Thanks for your prompt answer. However, I find these individuals perform poorly compared to the best model. Could you further check this? I have tried to get the best model based on the training error from the final population. However, the performance is still to be worse than the best individual. It is very weird.

I get predictions by using the following line of code:

prediction = [est.evaluate_model(x.Genotype, X) for x in est.individuals_]
foolnotion commented 1 year ago

The predictions you get are not scaled, while the best model returned by the SymbolicRegressor is scaled to the target using linear scaling: https://github.com/heal-research/pyoperon/blob/cpp20/pyoperon/sklearn.py#L446. Basically, the tree gets extended with two additional nodes at the root which perform the scaling.

If you scale the models you should get the same results (at least for the best model). Here, I am just applying the scaling, but can add the scale and offset as nodes to the model as shown in the snippet above.

models = [x.Genotype for x in reg.individuals_]                                                                                                                                                                                                
for m in models:                                                                                                                                                                                                                               
    y_pred = reg.evaluate_model(m, X_train)                                                                                                                                                                                                    
    r2 = r2_score(y_train, y_pred)                                                                                                                                                                                                             
    scale, offset = FitLeastSquares(y_pred, y_train)                                                                                                                                                                                           
    r2_scaled = r2_score(y_train, scale * y_pred + offset)                                                                                                                                                                                     
    print(r2, r2_scaled)
hengzhe-zhang commented 1 year ago

Thanks! I have tried your code. However, the results are still much worse. I doubt individuals stored in gp.Individuals are not the final individuals?

foolnotion commented 1 year ago

I need to have a look again, but I am pretty sure that the individuals are final. One thing to note is that only the first N=|pop| individuals from the gp.Individuals array represent the final population.

Could you share a piece of code which demonstrates the issue you are seeing? This would make it much easier to debug.

hengzhe-zhang commented 1 year ago

Thanks. Here is a link to the problematic part of my code. Is that convenient for you? Or do more need a cleaner example? https://github.com/hengzhe-zhang/SR-Forest/blob/c664a3cb588e81e149a560a58b59707feb8eb2bf/sr_forest/operon_ensemble.py#L509-L518

foolnotion commented 1 year ago

Thanks, I can run your code, will return with a comment when I figure out what's going on there.

foolnotion commented 1 year ago

Hi,

I think your code had some bugs, mainly https://github.com/hengzhe-zhang/SR-Forest/blob/c664a3cb588e81e149a560a58b59707feb8eb2bf/sr_forest/operon_ensemble.py#L512 seems suspicious - where does that y come from?

With the following changes to operon_ensemble.py, everything seems to work:

461c461
<         self.individuals_ = gp.Individuals
---
>         self.individuals_ = [ get_solution_stats(x)[0] for x in gp.Individuals[:self.population_size] ]
509,517c509
<         for id, x in enumerate(est.individuals_):
<             if not id in self.linear_scaling:
<                 y_pred = est.evaluate_model(x.Genotype, X)
<                 scale, offset = op.FitLeastSquares(y_pred, y)
<                 nodes = x.Genotype.Nodes + [op.Node.Constant(scale), op.Node.Mul(), op.Node.Constant(offset),
<                                             op.Node.Add()]
<                 x.Genotype = op.Tree(nodes).UpdateNodes()
<                 self.linear_scaling.add(id)
<         prediction = [est.evaluate_model(x.Genotype, X) for x in est.individuals_]
---
>         prediction = [est.evaluate_model(x, X) for x in est.individuals_]
527c519
<     e = OperonX(generations=10)
---
>     e = OperonX(generations=10, population_size=100)
529,530c521,522
<     print(r2_score(y_train, e.predict(x_train)))
<     print(r2_score(y_test, e.predict(x_test)))
---
>     print('operon train', r2_score(y_train, e.predict(x_train)))
>     print('operon test', r2_score(y_test, e.predict(x_test)))
532c524
<     e = OperonForest(generations=10)
---
>     e = OperonForest(generations=10, population_size=100)
534,535c526,527
<     print(r2_score(y_train, e.predict(x_train)))
<     print(r2_score(y_test, e.predict(x_test)))
---
>     print('operon forest train', r2_score(y_train, e.predict(x_train)))
>     print('operon forest test', r2_score(y_test, e.predict(x_test)))
537c529
<     e = OperonForest(generations=10, decision_tree=DecisionTreeRegressor(splitter='random'))
---
>     e = OperonForest(generations=10, population_size=100, decision_tree=DecisionTreeRegressor(splitter='random'))
539c531,532
<     print(r2_score(y_test, e.predict(x_test)))
---
>     print('operon forest+DT train', r2_score(y_train, e.predict(x_train)))
>     print('operon forest+DT test', r2_score(y_test, e.predict(x_test)))

This outputs:

$ python operon_ensemble.py 
/.mamba/envs/srcomp-operon/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.
  warnings.warn(msg, category=FutureWarning)
operon train 0.6883213909269199
operon test 0.5016739531419744
operon forest train 0.6824081991377764
operon forest test 0.5756674827531514
operon forest+DT train 1.0
operon forest+DT test 0.7364281744329407
hengzhe-zhang commented 1 year ago

Excellent! It seems that Operon works very well in SR-Forest. Thanks for your help.