heal-research / pyoperon

Python bindings and scikit-learn interface for the Operon library for symbolic regression.
MIT License
42 stars 12 forks source link

ValueError: Input contains NaN. #6

Closed hengzhe-zhang closed 1 year ago

hengzhe-zhang commented 1 year ago

I got an error when running Operon multiple times.

X, y = fetch_openml(data_id=1089, return_X_y=True)
X = StandardScaler().fit_transform(X)
X, y = np.array(X), np.array(y)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
simple_operon = []
for _ in range(20):
    e = OperonX(generations=100, population_size=100)
    e.fit(x_train, y_train)
    print(r2_score(y_train, e.predict(x_train)))
    print(r2_score(y_test, e.predict(x_test)))
    simple_operon.append(r2_score(y_test, e.predict(x_test)))

The error information is as follows:

Traceback (most recent call last):
  File "/tmp/pycharm_project_44/example/performance_evaluation.py", line 22, in <module>
    e.fit(x_train, y_train)
  File "/tmp/pycharm_project_44/sr_forest/operon_forest.py", line 461, in fit
    self.individuals_ = [get_solution_stats(x)[0] for x in gp.Individuals[:self.population_size]]
  File "/tmp/pycharm_project_44/sr_forest/operon_forest.py", line 461, in <listcomp>
    self.individuals_ = [get_solution_stats(x)[0] for x in gp.Individuals[:self.population_size]]
  File "/tmp/pycharm_project_44/sr_forest/operon_forest.py", line 438, in get_solution_stats
    mse = mean_squared_error(y, y_pred * scale + offset)
  File "/vol/ecrg-solar/zhangheng1/anaconda3/envs/gpgomenv/lib/python3.8/site-packages/sklearn/metrics/_regression.py", line 442, in mean_squared_error
    y_type, y_true, y_pred, multioutput = _check_reg_targets(
  File "/vol/ecrg-solar/zhangheng1/anaconda3/envs/gpgomenv/lib/python3.8/site-packages/sklearn/metrics/_regression.py", line 102, in _check_reg_targets
    y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
  File "/vol/ecrg-solar/zhangheng1/anaconda3/envs/gpgomenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 899, in check_array
    _assert_all_finite(
  File "/vol/ecrg-solar/zhangheng1/anaconda3/envs/gpgomenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 146, in _assert_all_finite
    raise ValueError(msg_err)
ValueError: Input contains NaN.

Process finished with exit code 1

I guess the reason might be some SR models predict nan values, and thus lead to scikit-learn raising such an error. However, I don't know how to fix this problem. Can you help me to deal with this problem? Thanks.

Here is a reproducible example.

foolnotion commented 1 year ago

Hi,

Thanks for reporting this bug. The models can indeed predict NaN values, leading to that error.

I think the problem is here at line 438: https://github.com/hengzhe-zhang/SR-Forest/blob/master/sr_forest/operon_forest.py#L438

mse = mean_squared_error(y, y_pred * scale + offset)

This code should fix it:

try:
    mse = mean_squared_error(y, y_pred * scale + offset)
except ValueError:
    mse = sys.maxsize # or whatever you find appropriate