natekupp / ffx

Fast Function Extraction
http://trent.st/ffx
Other
80 stars 97 forks source link

FFX sometimes crashes with NaN or infinity with apparently simple data #4

Closed jmmcd closed 11 years ago

jmmcd commented 11 years ago

This file crashes:

#!/usr/bin/env python

import numpy as np
import ffx

# This creates a dataset of 1 predictor
train_X = np.array([[0, 1, 2, 3]]).T
train_y = np.array([0, 1, 4, 9])

test_X = np.array([[4, 5, 6, 7]]).T
test_y = np.array([16, 25, 36, 49])

models = ffx.run(train_X, train_y, test_X, test_y, ["x"])
Traceback (most recent call last):
  File "./test2.py", line 13, in <module>
    models = ffx.run(train_X, train_y, test_X, test_y, ["x"])
  File "/Users/jmmcd/Documents/vc/ffx/ffx/api.py", line 4, in run
    return core.MultiFFXModelFactory().build(train_X, train_y, test_X, test_y, varnames, verbose)
  File "/Users/jmmcd/Documents/vc/ffx/ffx/core.py", line 443, in build
    next_models = FFXModelFactory().build(train_X, train_y, ss, varnames, verbose)
  File "/Users/jmmcd/Documents/vc/ffx/ffx/core.py", line 584, in build
    ss, varnames, order1_bases, X, y, max_num_bases, target_train_nmse, verbose)
  File "/Users/jmmcd/Documents/vc/ffx/ffx/core.py", line 683, in _basesToModels
    max_num_bases, target_train_nmse, verbose)
  File "/Users/jmmcd/Documents/vc/ffx/ffx/core.py", line 729, in _pathwiseLearn
    clf.fit(X_unbiased, y_unbiased, coef_init=cur_unbiased_coefs)
  File "/Users/jmmcd/Documents/vc/ffx/ffx/core.py", line 849, in new_f
    result = f(*args, **kwargs)
  File "/Users/jmmcd/Documents/vc/ffx/ffx/core.py", line 861, in fit
    return ElasticNet.fit(self, *args, **kwargs)
  File "/Users/jmmcd/Documents/dev/anaconda/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py", line 179, in fit
    copy=self.copy_X and self.fit_intercept)
  File "/Users/jmmcd/Documents/dev/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 107, in atleast2d_or_csc
    "tocsc")
  File "/Users/jmmcd/Documents/dev/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 96, in _atleast2d_or_sparse
    X = array2d(X, dtype=dtype, order=order, copy=copy)
  File "/Users/jmmcd/Documents/dev/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 81, in array2d
    _assert_all_finite(X_2d)
  File "/Users/jmmcd/Documents/dev/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Whereas this one works ok -- only difference is the data:

#!/usr/bin/env python

import numpy as np
import ffx

# This creates a dataset of 1 predictor
train_X = np.array([[0, 1, 2, 3]]).T
train_y = np.array([1, 2, 3, 4])

test_X = np.array([[4, 5, 6, 7]]).T
test_y = np.array([5, 6, 7, 8])

models = ffx.run(train_X, train_y, test_X, test_y, ["x"])
doccosmos commented 11 years ago

This is true for the real-world test data sets on the ffx home page as well.

jmmcd commented 11 years ago

Thanks for the extra report. I've just had another look. I think it happens when we "unbias" the data, ie normalise to mean 0 and stddev 1. Dividing by the stddev gives NaN if stddev = 0. In that case the variable is actually constant, so it's ok to just replace it with a zero, I think. When rebiasing I'm not sure whether any change is needed.

Anyway, please try putting this code for _unbiasedXy. It fixes my test, above. Probably there's a more idiomatic way to express this in numpy.

    def _unbiasedXy(self, Xin, yin):
        """Make all input rows of X, and y, to have mean=0 stddev=1 """
        #unbiased X
        X_avgs, X_stds = Xin.mean(0), Xin.std(0)
        X_unbiased = (Xin - X_avgs) / X_stds
        #check whether any stddevs were 0 -- if so, use (value - mean)
        bad_rows = numpy.any(~numpy.isfinite(X_unbiased), 1)
        for i, bad in enumerate(bad_rows):
            if bad:
                X_unbiased[i] = (Xin[i] - X_avgs[i])

        #unbiased y
        y_avg, y_std = yin.mean(0), yin.std(0)
        y_unbiased = (yin - y_avg) / y_std
        #check whether stddev was 0 -- if so, use (value - mean)
        if numpy.any(~numpy.isfinite(y_unbiased)):
            y_unbiased = yin - y_avg

        assert numpy.all(numpy.isfinite(X_unbiased))
        assert numpy.all(numpy.isfinite(y_unbiased))

        return (X_unbiased, y_unbiased, X_avgs, X_stds, y_avg, y_std)
jmmcd commented 11 years ago

Fixed with 4556878031a8ad6b92f3be2b82ad0427b3c5370f