heal-research / operon

C++ Large Scale Genetic Programming
https://operongp.readthedocs.io
MIT License
144 stars 26 forks source link

ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). #15

Closed hengzhe-zhang closed 3 years ago

hengzhe-zhang commented 3 years ago

Recently, I find that this package will raise an exception if my dataset is similar to the following one. I have no idea why this exception will happen because it is a normal dataset. Is there any possible solution to this problem?

import numpy as np
from operon.sklearn import SymbolicRegressor
from sklearn.model_selection import cross_val_score, train_test_split

X = np.random.rand(100, 5)
y = np.ones(100)
est = SymbolicRegressor(
    local_iterations=5,
    generations=10,  # just large enough since we have an evaluation budget
    n_threads=1,
    random_state=None,
    time_limit=2 * 60 * 60,  # 2 hours
    max_evaluations=int(5e5),
    population_size=10
)
print(cross_val_score(est, X, y))
print(y)
foolnotion commented 3 years ago

Hi,

You are trying to train a model on random data. It can happen that inside cross-validation, the model prediction on the test fold is NaN, infinity or too large, hence the error. My suggestion would be to just catch this exception when it happens, or implement some logic to handle the invalid values in the model prediction.

hengzhe-zhang commented 3 years ago

I believe this problem is related to the implementation of this package. In fact, other machine learning algorithms, such as linear regression, can work well in this situation. For example, the following code does not work properly even if it is a rather common situation.

import numpy as np
from operon.sklearn import SymbolicRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = np.random.rand(100, 5)
y = np.ones(100)
est = SymbolicRegressor(
    local_iterations=5,
    generations=10,  # just large enough since we have an evaluation budget
    n_threads=1,
    random_state=None,
    time_limit=2 * 60 * 60,  # 2 hours
    max_evaluations=int(5e5),
    population_size=10
)
# est = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
est.fit(x_train, y_train)
print(est.predict(x_test))
print(np.any(np.isnan(est.predict(x_test))))

In fact, I'm trying to replicate the experiment of the "srbench" benchmark (https://github.com/EpistasisLab/srbench/), but I find this issue will happen during the running procedure.

foolnotion commented 3 years ago

you were right, there was a bug in the python wrapper. after a model is fit, we try to apply linear scaling to bring it in the range of the target: https://github.com/heal-research/operon/blob/master/python/operon/sklearn.py#L318 however, this did not work correctly when the variance of the target was zero (y = np.ones(100)). should be fixed in e7b83262c1163236144d6f59a4c5dbdd0b5f2c7b