Model length does not coincide with the usual definition

heal-research / pyoperon

Python bindings and scikit-learn interface for the Operon library for symbolic regression.

MIT License

42 stars 12 forks source link

Hi,

I am sorry to ask questions (again) through an issue. I take this opportunity to thank all the Operon team because this library is great! I have a doubt related to the definition used for length and max_length. In the Operon paper the length of an individual is defined as the number of nodes in its tree representation. But, when I ran the following example

from sklearn.model_selection import train_test_split

from pyoperon.sklearn import SymbolicRegressor
from pmlb import regression_dataset_names, fetch_data

X, y = fetch_data('1027_ESL', return_X_y=True, local_cache_dir='./datasets')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True)

reg = SymbolicRegressor(
        allowed_symbols='add,sub,mul,aq,sin,constant,variable',
        offspring_generator='basic',
        generations=10,
        optimizer_iterations=10,
        max_length=20,
        initialization_method='btc',
        n_threads=8,
        objectives = ['r2'],
        max_evaluations=int(1e6))

reg.fit(X_train, y_train)
print(reg.get_model_string(reg.model_, precision=10))
print(reg.stats_)

I got the following output

(0.0145737370 + (0.9973683357 * (((2.1865003109 * X3) / (sqrt(1 + (2.9363973141 * X2) ^ 2))) - ((sin(((0.7525053024 * X2) + ((-0.2822726369) * X4))) * ((-0.1257296056) * X4)) - ((((4.4178400040 * X2) + ((-2.0412197113) * X1)) + (2.8924779892 * X4)) * ((0.0185590051 * X1) - ((-0.0078712795) * X3)))))))
{'model_length': 20, 'model_complexity': 40, 'generations': 10, 'evaluation_count': 10991, 'residual_evaluations': 80430, 'jacobian_evaluations': 47094, 'random_state': 5542950052447429511}

and the final expressions has length higher than 20. I am sure that I am missing something, can you help me to understand what to do you mean by model_length?

There are several separate issues that lead to operon producing this expression which is longer than 20 symbols. The main reason is that the internal representation for expressions is different from the representation chosen for the output.

Operon implicitly uses a coefficient for each variable symbol but counts this internally only as a single tree node for length. The solutions also have a model_complexity which also includes the coefficients and corresponding multiplication symbols. Note that both measures do not count the linear scaling terms (see 2.). https://github.com/heal-research/pyoperon/blob/7130a875b5565a28403df1746822107412d455ee/pyoperon/sklearn.py#L565-L573
Operon adds a scaling factor and a constant on the top of the expression. These are four additional symbols that are not counted in length and model_complexity. Linear scaling can be turned off, alternatively reduce the length limit by four nodes.
Your expression uses the analytic quotient (aq) function aq(a,b) = a/sqrt(b²+1) internally. Because this is not a common function available in all programming languages, it is expanded in the final expression. An option is to disable the aq symbol.

The expression: (0.0145737370 + (0.9973683357 * (((2.1865003109 * X3) / (sqrt(1 + (2.9363973141 * X2) ^ 2))) - ((sin(((0.7525053024 * X2) + ((-0.2822726369) * X4))) * ((-0.1257296056) * X4)) - ((((4.4178400040 * X2) + ((-2.0412197113) * X1)) + (2.8924779892 * X4)) * ((0.0185590051 * X1) - ((-0.0078712795) * X3)))))))

is internally represented as: offset + scale * (aq(c1_x3, c2_x2) - (sin(c3_x2 + c4_x4) * c5_x4 - ((c6_x2 + c7_x1 + c8_x4) * c9_x1 - c10_x3)))

Ignoring the 4 symbols for offset and scale this has exactly 20 symbols.

heal-research / pyoperon

Model length does not coincide with the usual definition #24