hyperopt / hyperopt-sklearn

Hyper-parameter optimization for sklearn
hyperopt.github.io/hyperopt-sklearn
Other
1.58k stars 271 forks source link

Zero-dimensional arrays cannot be concatenated #132

Open ptynecki opened 5 years ago

ptynecki commented 5 years ago

Hey,

I received ValueError: zero-dimensional arrays cannot be concatenated exception when I tried to use .fit() method of HyperoptEstimator with SVM.

RANDOM_STATE = 42

# Train/Test split
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=RANDOM_STATE)

for train_index, test_index in split.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

X_train.shape, X_test.shape, y_train.shape, y_test.shape

It returns ((186, 4096), (47, 4096), (186,), (47,)) as expected.

estim = HyperoptEstimator(classifier=svc('mySVC'), seed=RANDOM_STATE)
estim.fit(X_train, y_train, n_folds=10, cv_shuffle=True, random_state=RANDOM_STATE)
estim.score(X_test, y_test)

X_train, y_train and X_test, y_test represents scipy.sparse.csr.csr_matrix type.

Exception details:

    618             assert fn_rval[0] in ('raise', 'return')
    619             if fn_rval[0] == 'raise':
--> 620                 raise fn_rval[1]
    621 
    622             # -- remove potentially large objects from the rval

ValueError: zero-dimensional arrays cannot be concatenated
stevacca commented 5 years ago

Hey, did you solve the issue? I'm having the same problem and I'm working on string data type.

anjani-dhrangadhariya commented 5 years ago

I face the same issue with a text classification task.

@bjkomer and @adodge seem to have fixed it here, but the issue has reappeared.

adodge commented 5 years ago

I'm encountering a small stack of issues while reproducing this. I'll summarize them here, but they might warrant separate tickets.

hpsklearn pypi versioning

It seems like the version of hpsklearn you get when you pip install hpsklearn is fairly old, compared to the master branch here. It looks like there are two versions in pypi (https://pypi.org/project/hpsklearn/#history): 0.0.3 and 0.1.0. Unfortunately, 0.0.3 is far newer than 0.1.0. Based on the version you report, and the error you get, I think this is what's happened.

I'll also note that the 0.0.3 release in github (https://github.com/hyperopt/hyperopt-sklearn/releases) is not the same as the 0.0.3 release in pypi. It might be the same as the 0.1.0 version in pypi, based on the dates. This should be aligned, but I don't know the consequences of altering these. (Possibly nothing, or is it automatically linked up with pypi somehow?)

I think the easiest way to fix this would be to push the current master branch of hpsklearn to pypi as version 0.1.1 (and subsequent newer versions with higher version numbers), so it shows up as an upgrade for people currently running 0.1.0 and is the default "newest" version for new installs.

I note that the README for hpsklearn says that only installing from github is supported. I still think we should probably do this bit of housekeeping, to avoid problems in the future.

string encoding error when installing from github

git clone https://github.com/hyperopt/hyperopt-sklearn.git
(cd hyperopt-sklearn && pip3 install -e .)
ERROR: Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/data/notebooks/notebooks/hyperopt-sklearn/setup.py", line 24, in <module>
        long_description = open('README.md').read(),
      File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1819: ordinal not in range(128)

This might be a problem in my setup. I'll look at this later and possibly write a ticket.

The workaround that works for me is to wipe the README.md file before installing. Obviously not a good solution.

git clone https://github.com/hyperopt/hyperopt-sklearn.git
echo "" > hyperopt-sklearn/README.md
(cd hyperopt-sklearn && pip3 install -e .)

PCA doesn't support sparse

After installing hpsklearn from github directly you get a version that includes the fix to #105 . The test code then fails later when the search tries a PCA preprocessor.

I suspect this is a variation of the problem fixed in #105 . I'll take a look. You can disable preprocessor search with:

estim = HyperoptEstimator(classifier=svc('mySVC'),
                          preprocessing=[],
                          seed=RANDOM_STATE)

...which then allows the fitting to complete.

I'll follow up with a fix to the PCA problem.

adodge commented 5 years ago

Here's a notebook illustrating my process for installing hpsklearn from github (with the hack to fix the string encoding problem) and disabling preprocessing to avoid the PCA issue for now:

https://github.com/adodge/notebooks/blob/master/hpsklearn_132.ipynb

adodge commented 5 years ago

So, the problem with PCA isn't on the hyperopt side, it's just that sklearn's PCA doesn't support sparse inputs.

I've made a pull request here https://github.com/hyperopt/hyperopt-sklearn/pull/137 which includes a function any_sparse_preprocessing that only has components that support sparse inputs. This can also just be included in your project and used that way.

from hyperopt import hp
from hpsklearn import standard_scaler, normalizer

def any_sparse_preprocessing(name):
    """
    Preprocessors that support sparse input.
    * missing pca
    * missing min_max_scaler
    * missing one_hot_encoder (because it's also not in any_preprocessing)
    * standard_scaler has with_mean=False, at the recommendation of the sklearn
      error message
    """
    return hp.choice('%s' % name, [
        [standard_scaler(name + '.standard_scaler', with_mean=False)],
        [normalizer(name + '.normalizer')],
        []
    ])

estim = HyperoptEstimator(classifier=svc('mySVC'),
                          preprocessing=any_sparse_preprocessing('preproc'),
                          seed=RANDOM_STATE)
anjani-dhrangadhariya commented 5 years ago

@adodge Thank you for the workaround. I was able to install the package using this notebook (https://github.com/adodge/notebooks/blob/master/hpsklearn_132.ipynb), but still get the same error (with the same example as you have provided).

adodge commented 5 years ago

Hmm... Odd. Can you verify that python is actually using the version you installed? I've had situations where multiple versions of something were co-existing and the wrong one was getting imported. (And there's some weird versioning stuff going on here, so I wouldn't be surprised if something got confused.)

import hpsklearn
print(hpsklearn.__file__)

If you did the installation from the notebook (with "git clone [...] ; pip install -e [...]"), this should show a path in the git repo you checked out. For me it's /data/notebooks/notebooks/hyperopt-sklearn/hpsklearn/__init__.py, because I ran the install commands from /data/notebooks/notebooks/.

If you got the current version from github, the "estimator.py" in that same directory (for me /data/notebooks/notebooks/hyperopt-sklearn/hpsklearn/estimator.py should have a function safe_concatenate. This is part of the fix to issue #105, which is in github but not in the version distributed with pypi. (https://github.com/hyperopt/hyperopt-sklearn/blob/master/hpsklearn/estimator.py#L117-L124)

Here's a check for this:

import os
import hpsklearn
fn = os.path.join(os.path.split(hpsklearn.__file__)[0],'estimator.py')
X = open(fn).read()
assert 'safe_concatenate' in X, 'Installed version does\'t have the fix'

If it turns out python is looking at a different version, I think the thing to do would be to uninstall it. Something like pip uninstall hpsklearn might work, or you might need to go find the offending directory and delete it. (Or maybe delete your virtualenv and start again, if you're using that.)

If it is pointing at the right version, or you run into trouble with this, can you please post the specific code you're running and error you're getting?

Thank you!