scikit-learn-contrib / sklearn-pandas

Pandas integration with sklearn
Other
2.8k stars 412 forks source link

Tests failing #2

Closed tdhopper closed 10 years ago

tdhopper commented 11 years ago

I'm seeing a bunch of tests fail. I'm on Windows 7 with Python 2.7.5 via Anaconda 1.6.2 (64-bit).

C:\Anaconda\Lib\site-packages>python -m doctest README.rst
**********************************************************************
File "README.rst", line 75, in README.rst
Failed example:
    mapper.fit_transform(data)
Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[5]>", line 1, in <module>
        mapper.fit_transform(data)
      File "sklearn\base.py", line 408, in fit_transform
        return self.fit(X, **fit_params).transform(X)
      File "sklearn_pandas\__init__.py", line 46, in fit
        transformer.fit(X[columns])
      File "sklearn\preprocessing\label.py", line 241, in fit
        self.classes_ = unique_labels(y)
      File "sklearn\utils\multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type")
    ValueError: Unknown label type
**********************************************************************
File "README.rst", line 89, in README.rst
Failed example:
    mapper.transform({'pet': ['cat'], 'children': [5.]})
Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[6]>", line 1, in <module>
        mapper.transform({'pet': ['cat'], 'children': [5.]})
      File "sklearn_pandas\__init__.py", line 52, in transform
        fea = transformer.transform(X[columns])
      File "sklearn\preprocessing\label.py", line 261, in transform
        self._check_fitted()
      File "sklearn\preprocessing\label.py", line 221, in _check_fitted
        raise ValueError("LabelBinarizer was not fitted yet.")
    ValueError: LabelBinarizer was not fitted yet.
**********************************************************************
File "README.rst", line 103, in README.rst
Failed example:
    mapper2.fit_transform(data)
Expected:
    array([[ 47.62288153],
           [-18.38596516],
           [  1.62873661],
           [-15.3709553 ],
           [-10.36602451],
           [ 16.62846476],
           [ -6.38116123],
           [-15.37597671]])
Got:
    array([[ 47.62195051],
           [-18.39077736],
           [  1.63037658],
           [-15.36917967],
           [-10.36208485],
           [ 16.62998504],
           [ -6.38386526],
           [-15.376405  ]])
**********************************************************************
File "README.rst", line 123, in README.rst
Failed example:
    cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_error)

Exception raised:
    Traceback (most recent call last):
      File "C:\Anaconda\lib\doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest README.rst[10]>", line 1, in <module>
        cross_val_score(pipe, data, data.salary, sklearn.metrics.mean_squared_er
ror)
      File "sklearn_pandas\__init__.py", line 34, in cross_val_score
        return cross_validation.cross_val_score(df, X_indices, *args, **kwargs)
      File "sklearn\cross_validation.py", line 1152, in cross_val_score
        for train, test in cv)
      File "sklearn\externals\joblib\parallel.py", line 517, in __call__
        self.dispatch(function, args, kwargs)
      File "sklearn\externals\joblib\parallel.py", line 312, in dispatch
        job = ImmediateApply(func, args, kwargs)
      File "sklearn\externals\joblib\parallel.py", line 136, in __init__
        self.results = func(*args, **kwargs)
      File "sklearn\cross_validation.py", line 1060, in _cross_val_score
        estimator.fit(X_train, y_train, **fit_params)
      File "sklearn_pandas\__init__.py", line 19, in fit
        self.estimator.fit(self._get_row_subset(x), y)
      File "sklearn\pipeline.py", line 130, in fit
        Xt, fit_params = self._pre_transform(X, y, **fit_params)
      File "sklearn\pipeline.py", line 120, in _pre_transform
        Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
      File "sklearn\base.py", line 411, in fit_transform
        return self.fit(X, y, **fit_params).transform(X)
      File "sklearn_pandas\__init__.py", line 46, in fit
        transformer.fit(X[columns])
      File "sklearn\preprocessing\label.py", line 241, in fit
        self.classes_ = unique_labels(y)
      File "sklearn\utils\multiclass.py", line 98, in unique_labels
        raise ValueError("Unknown label type")
    ValueError: Unknown label type
**********************************************************************
1 items had failures:
   4 of  11 in README.rst
***Test Failed*** 4 failures.
paulgb commented 11 years ago

Thanks for reporting this Tim.

I suspected having floating point calculations in the tests might lead to some issues -- I've changed the tests to only match the first two digits after the decimal.

Could you please paste the output of pip freeze for me? In the meantime I'll see if I can reproduce with the latest sklearn and pandas.

tdhopper commented 11 years ago
Flask==0.10.1
Jinja2==2.6
MDP==3.3
PIL==1.1.7
PySAL==1.5.0
PySide==1.1.2
PyYAML==3.10
Pygments==1.6
SQLAlchemy==0.8.1
Sphinx==1.1.3
Werkzeug==0.9.1
astropy==0.2.3
atom==0.2.3
binstar-client==0.1.0
biopython==1.61
bitarray==0.8.1
boto==2.9.6
casuarius==1.1
chaco==4.2.1
conda==1.8.1
cubes==0.10.2
distribute==0.6.45
docutils==0.10
enable==4.2.1
enaml==0.7.6
gevent==0.13.8
gevent-websocket==0.3.6
gevent-zeromq==0.2.2
greenlet==0.4.1
grin==1.2.1
h5py==2.1.1
ipython==0.13.2
itsdangerous==0.21
keyring==1.4
llvmmath==0.1
llvmpy==0.11.3
lxml==3.2.1
matplotlib==1.2.1
menuinst==1.0.1
meta==development
moves==0.1
networkx==1.7
nltk==2.0.4
nose==1.3.0
numba==0.9.0
numexpr==2.0.1
numpy==1.7.1
pandas==0.12.0
pep8==1.4.5
ply==3.4
praw==2.1.4
psutil==0.7.1
py==1.4.14
pycosat==0.6.0
pycparser==2.09.1
pycrypto==2.6
pyface==4.2.1
pyflakes==0.7.2
pyparsing==1.5.6
pyreadline==2.0-dev1
pytest==2.3.5
python-dateutil==2.1
pytz==2013b
pywin32==218.4
pyzmq==2.2.0.1
requests==1.2.3
rope==0.9.4
scikit-image==0.8.2
scikit-learn==0.14.1
scipy==0.12.0
simplejson==3.3.0
six==1.3.0
sklearn-pandas==0.0.3
spyder==2.2.0
statsmodels==0.4.3
sympy==0.7.2
tables==2.4.0
tornado==3.1
traits==4.2.1
traitsui==4.2.1
tweepy==2.1
update-checker==0.5
vincent==0.2
wsgiref==0.1.2
xlrd==0.9.2
xlwt==0.7.5
paulgb commented 11 years ago

I'm able to reproduce this. It seems the interface of sklearn has changed. The following code fails with scikit-learn 0.14.1 but works with scikit-learn 0.13.1:

import pandas as pd
import numpy as np
import sklearn.preprocessing

data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                     'children': [4., 6, 3, 3, 2, 3, 5, 4],
                     'salary':   [90, 24, 44, 27, 32, 59, 36, 27]})

lb = sklearn.preprocessing.LabelBinarizer()
lb.fit(data.pet)
print lb.transform(data.pet)

I need to investigate further to see if this is a sklearn bug or if the tests need to be adjusted appropriately.

dolaameng commented 11 years ago

I second that it is a "bug" in sklearn 0.14, in file sklearn/utils/multiclass.py, function type_of_target , line293

if y.ndim > 2 or y.dtype == object:
        return 'unknown'

It will return 'unknown' for any np.array like strings. So it works fine with ['cat', 'dog', 'fish'], but not with np.asarray(['cat', 'dog', 'fish']) anymore.

paulgb commented 11 years ago

Good find. type_of_target actually will return 'multiclass' for np.array(['cat', 'dog', 'fish']), because it has the right dtype ('|S1'). But calling as_matrix() on the pandas dataframe gives a matrix with dtype 'object' because the columns have different types (or maybe this is always the pandas behaviour, I haven't checked).

Still, it seems a little weird for the behaviour to change based on the data representation of two arrays that numpy considers equivalent.

I'm going to take a better look at the sklearn code to see if there's a good reason behind this and if I can't find one I'll file a bug report on that project.

In [38]: p = np.array(['a', 'b', 'c'])

In [39]: q = np.array(['a', 'b', 'c'], dtype='object')

In [40]: np.array_equal(p, q)
Out[40]: True

In [41]: type_of_target(p)
Out[41]: 'multiclass'

In [42]: type_of_target(q)
Out[42]: 'unknown'
paulgb commented 11 years ago

It seems to be a deeper disconnect between sklearn and pandas than I'd hoped: pandas seems to want string arrays to have the "object" dtype, while sklearn expects them to have the appropriate numpy string datatype. I've found a few hack solutions but none that I feel good about publishing. I'm continuing to dig into the sklearn code to see if there's a better way.

ogrisel commented 11 years ago

I think we want to support dtype='object' for arrays of variable length strings as well in scikit-learn: type_of_target(q) == 'unknown' is a probably a bug.

ogrisel commented 11 years ago

I wish numpy had a dtype for variable length strings...

paulgb commented 11 years ago

Yes, it's a shame that an array of sequences looks the same (from a dtype perspective) as an array of variable-length strings.

The least hacky workaround that I can think of without changing sklearn is to convert the arrays to fixed-length strings (np.array(X, dtype='|S')) before sending it off to sklearn, but it would be great to fix sklearn instead. Looks to me like it would just be a matter of looping over the array in sklearn to check if the types are all str/unicode objects, thoughts?

ogrisel commented 11 years ago

+1 for the temp workaround in sklearn_pandas to restaure sklearn 0.14 compat and I will create an issue for the sklearn project.

ogrisel commented 10 years ago

@paulgb wouldn't converting to list of string or unicode before sending to sklearn work even better?

paulgb commented 10 years ago

That would work, but the issue is more with knowing when to convert to strings without having to do a scan of the entire table.

I experimented a little more about pandas internals and it seems that object is used as a table dtype any time a table has heterogenous types, but as a column dtype only when the content is a string. So my workaround solution is to convert the matrix dtype to string if and only if every column in the mapping has the object dtype.

I've updated the code here and on PyPi to version 0.0.4 which includes this fix. I'd like to wait to hear some feedback on whether it solved the problem before closing this issue.

paulgb commented 10 years ago

Closing this optimistically based on tests passing and things working for Tim (https://twitter.com/tdhopper/status/381088588739272704)

tdhopper commented 10 years ago

First time one of my tweets has ever been mentioned as a reason for closing a bug report.

linehammer commented 3 years ago

There is a mismatch in "What you can pass" Vs. "What you are actually passing". This means that the scikit-learn library is not able to recognize what type of problem you want to solve ( regression or classification ). The Unknown label type: 'unknown' error raised related to the Y values that you use in scikit-learn .

solutions