heal-research / pyoperon

Python bindings and scikit-learn interface for the Operon library for symbolic regression.
MIT License
42 stars 12 forks source link

Model length does not coincide with the usual definition #24

Closed Smantii closed 2 weeks ago

Smantii commented 2 weeks ago

Hi,

I am sorry to ask questions (again) through an issue. I take this opportunity to thank all the Operon team because this library is great! I have a doubt related to the definition used for length and max_length. In the Operon paper the length of an individual is defined as the number of nodes in its tree representation. But, when I ran the following example

from sklearn.model_selection import train_test_split

from pyoperon.sklearn import SymbolicRegressor
from pmlb import regression_dataset_names, fetch_data

X, y = fetch_data('1027_ESL', return_X_y=True, local_cache_dir='./datasets')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True)

reg = SymbolicRegressor(
        allowed_symbols='add,sub,mul,aq,sin,constant,variable',
        offspring_generator='basic',
        generations=10,
        optimizer_iterations=10,
        max_length=20,
        initialization_method='btc',
        n_threads=8,
        objectives = ['r2'],
        max_evaluations=int(1e6))

reg.fit(X_train, y_train)
print(reg.get_model_string(reg.model_, precision=10))
print(reg.stats_)

I got the following output

(0.0145737370 + (0.9973683357 * (((2.1865003109 * X3) / (sqrt(1 + (2.9363973141 * X2) ^ 2))) - ((sin(((0.7525053024 * X2) + ((-0.2822726369) * X4))) * ((-0.1257296056) * X4)) - ((((4.4178400040 * X2) + ((-2.0412197113) * X1)) + (2.8924779892 * X4)) * ((0.0185590051 * X1) - ((-0.0078712795) * X3)))))))
{'model_length': 20, 'model_complexity': 40, 'generations': 10, 'evaluation_count': 10991, 'residual_evaluations': 80430, 'jacobian_evaluations': 47094, 'random_state': 5542950052447429511}

and the final expressions has length higher than 20. I am sure that I am missing something, can you help me to understand what to do you mean by model_length?

gkronber commented 2 weeks ago

There are several separate issues that lead to operon producing this expression which is longer than 20 symbols. The main reason is that the internal representation for expressions is different from the representation chosen for the output.

  1. Operon implicitly uses a coefficient for each variable symbol but counts this internally only as a single tree node for length. The solutions also have a model_complexity which also includes the coefficients and corresponding multiplication symbols. Note that both measures do not count the linear scaling terms (see 2.). https://github.com/heal-research/pyoperon/blob/7130a875b5565a28403df1746822107412d455ee/pyoperon/sklearn.py#L565-L573
  2. Operon adds a scaling factor and a constant on the top of the expression. These are four additional symbols that are not counted in length and model_complexity. Linear scaling can be turned off, alternatively reduce the length limit by four nodes.
  3. Your expression uses the analytic quotient (aq) function aq(a,b) = a/sqrt(b²+1) internally. Because this is not a common function available in all programming languages, it is expanded in the final expression. An option is to disable the aq symbol.

The expression: (0.0145737370 + (0.9973683357 * (((2.1865003109 * X3) / (sqrt(1 + (2.9363973141 * X2) ^ 2))) - ((sin(((0.7525053024 * X2) + ((-0.2822726369) * X4))) * ((-0.1257296056) * X4)) - ((((4.4178400040 * X2) + ((-2.0412197113) * X1)) + (2.8924779892 * X4)) * ((0.0185590051 * X1) - ((-0.0078712795) * X3)))))))

is internally represented as: offset + scale * (aq(c1_x3, c2_x2) - (sin(c3_x2 + c4_x4) * c5_x4 - ((c6_x2 + c7_x1 + c8_x4) * c9_x1 - c10_x3)))

Ignoring the 4 symbols for offset and scale this has exactly 20 symbols.

Smantii commented 2 weeks ago

Thank you so much @gkronber, your answer clarifies everything