paris-saclay-cds / ramp-workflow

Toolkit for building predictive workflows on top of pydata (pandas, scikit-learn, pytorch, keras, etc.).
https://paris-saclay-cds.github.io/ramp-docs/
BSD 3-Clause "New" or "Revised" License
68 stars 43 forks source link

drug spectra is failing with new sklearn #152

Closed kegl closed 5 years ago

kegl commented 5 years ago

Please check this. It's weird because I though we had more than 100 components.

_____ test_submission[/Users/kegl/Research/RAMP/ramp-workflow/rampwf/tests/kits/drug_spectra] ______

path_kit = '/Users/kegl/Research/RAMP/ramp-workflow/rampwf/tests/kits/drug_spectra'

    @pytest.mark.parametrize(
        "path_kit",
        _generate_grid_path_kits()
    )
    def test_submission(path_kit):
        submissions = sorted(glob.glob(os.path.join(path_kit, 'submissions', '*')))
        for sub in submissions:
            # FIXME: to be removed once el-nino tests is fixed.
            if 'el_nino' in sub:
                pytest.xfail('el-nino is failing due to xarray.')
            else:
                assert_submission(
                    ramp_kit_dir=path_kit,
                    ramp_data_dir=path_kit,
                    submission=os.path.basename(sub), is_pickle=True,
>                   save_y_preds=False, retrain=True)

rampwf/tests/test_kits.py:57: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
rampwf/utils/testing.py:114: in assert_submission
    fold, ramp_data_dir)
rampwf/utils/submission.py:154: in run_submission_on_cv_fold
    fold_output_path, train_is=train_is)
rampwf/utils/submission.py:91: in train_test_submission
    module_path, X_train, y_train, train_is=train_is)
rampwf/workflows/drug_spectra.py:25: in train_submission
    train_submission(module_path, X_train_df, y_train_clf_array)
rampwf/workflows/feature_extractor_classifier.py:21: in train_submission
    module_path, X_train_array, y_array[train_is])
rampwf/workflows/classifier.py:14: in train_submission
    clf.fit(X_array[train_is], y_array[train_is])
rampwf/tests/kits/drug_spectra/submissions/starting_kit/classifier.py:19: in fit
    self.clf.fit(X, y)
../../../anaconda/lib/python2.7/site-packages/sklearn/pipeline.py:265: in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
../../../anaconda/lib/python2.7/site-packages/sklearn/pipeline.py:230: in _fit
    **fit_params_steps[name])
../../../anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py:329: in __call__
    return self.func(*args, **kwargs)
../../../anaconda/lib/python2.7/site-packages/sklearn/pipeline.py:614: in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
../../../anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py:359: in fit_transform
    U, S, V = self._fit(X)
../../../anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py:406: in _fit
    return self._fit_full(X, n_components)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = PCA(copy=True, iterated_power='auto', n_components=100, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
X = array([[0.0143867, 0.0143828, 0.0141687, ..., 0.0180102, 0.0178433,
        0.....0155589, 0.0155054, 0.0153493, ..., 0.0174623, 0.0174794,
        0.0172912]])
n_components = 100

    def _fit_full(self, X, n_components):
        """Fit the model by computing full SVD on X"""
        n_samples, n_features = X.shape

        if n_components == 'mle':
            if n_samples < n_features:
                raise ValueError("n_components='mle' is only supported "
                                 "if n_samples >= n_features")
        elif not 0 <= n_components <= min(n_samples, n_features):
            raise ValueError("n_components=%r must be between 0 and "
                             "min(n_samples, n_features)=%r with "
                             "svd_solver='full'"
>                            % (n_components, min(n_samples, n_features)))
E           ValueError: n_components=100 must be between 0 and min(n_samples, n_features)=80 with svd_solver='full'

../../../anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py:425: ValueError
kegl commented 5 years ago

The travis of the ramp kit goes through https://travis-ci.org/ramp-kits/drug_spectra/jobs/366004310, so this may be some data selection issue in https://travis-ci.org/paris-saclay-cds/rampwf-kits-test-master

jorisvandenbossche commented 5 years ago

Do you have this locally, or is it on travis somewhere? And what version of sklearn?

kegl commented 5 years ago

It's the main test: https://travis-ci.org/paris-saclay-cds/ramp-workflow

jorisvandenbossche commented 5 years ago

sorry, see your title of "failing with new sklearn"

kegl commented 5 years ago

I'm rerunning drug spectra in https://travis-ci.org/paris-saclay-cds/rampwf-kits-test-master, but I think this is fine.

kegl commented 5 years ago

I'm guessing that travis pulls the latest sklearn that just got released. I did pip -U on my mac and pytest fails with the same error (in rampwf).

jorisvandenbossche commented 5 years ago

I'm guessing that travis pulls the latest sklearn that just got released

Yes, that is the case I think.

From http://scikit-learn.org/dev/whats_new.html#sklearn-decomposition

In decomposition.PCA selecting a n_components parameter greater than the number of samples now raises an error. Similarly, the n_components=None case now selects the minimum of n_samples and n_features

So it might be we already had that problem, but now they started erroring on it

kegl commented 5 years ago

maybe, so then just modify the command in the main test pls.

jorisvandenbossche commented 5 years ago

See https://github.com/paris-saclay-cds/ramp-workflow/pull/153