RandomUnderSampler should permit datasets with NaN values

InterferencePattern commented 4 years ago

Description

RandomUnderSampler should permit undersampling a dataset containing NaNs. I don't see a reason this should be blocked by the def _check_X_y() function.

Steps/Code to Reproduce

Run make_imbalance() with a dataset containing NaN.

Expected Results

Function should return X and y datasets

Actual Results

<!-- Please paste or specifically describe the actual output or traceback. -->
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-c5130ea920d6> in <module>
      7                                             df_y,
      8                                             sampling_strategy={0: num_negs, 1: num_pos},
----> 9                                             random_state=20)
     10 # Convert back to df
     11 X_unbalanced = pd.DataFrame(X_unbalanced, columns=df_X.columns).astype(df_X.dtypes)

(redacted)/make_imbalance.py in make_imbalance(X, y, sampling_strategy, ratio, random_state, verbose, **kwargs)
    107         replacement=False,
    108         random_state=random_state)
--> 109     X_resampled, y_resampled = rus.fit_resample(X, y)
    110     if verbose:
    111         print('Make the dataset imbalanced: %s', Counter(y_resampled))

(redacted)/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
     77 
     78         check_classification_targets(y)
---> 79         X, y, binarize_y = self._check_X_y(X, y)
     80 
     81         self.sampling_strategy_ = check_sampling_strategy(

(redacted)/lib/python3.6/site-packages/imblearn/under_sampling/_prototype_selection/_random_under_sampler.py in _check_X_y(X, y)
     99     def _check_X_y(X, y):
    100         y, binarize_y = check_target_type(y, indicate_one_vs_all=True)
--> 101         X = check_array(X, accept_sparse=['csr', 'csc'], dtype=None)
    102         y = check_array(y, accept_sparse=['csr', 'csc'], dtype=None,
    103                         ensure_2d=False)

(redacted)/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    576         if force_all_finite:
    577             _assert_all_finite(array,
--> 578                                allow_nan=force_all_finite == 'allow-nan')
    579 
    580     if ensure_min_samples > 0:

(redacted)/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     63     elif X.dtype == np.dtype('object') and not allow_nan:
     64         if _object_dtype_isnan(X).any():
---> 65             raise ValueError("Input contains NaN")
     66 
     67

ValueError: Input contains NaN

Versions

Python 3.6.9 (default, Sep 11 2019, 16:40:19) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] NumPy 1.18.1 SciPy 1.4.1 Scikit-Learn 0.22.2.post1 Imbalanced-Learn 0.5.0

chkoar commented 4 years ago

Since estimator tags have been merged into scikit-learn master #167 could be continued.

glemaitre commented 4 years ago

This has been solved in master.

davidfstein commented 4 years ago

This is working on master, however, the ratio argument appears to have been removed. Is there anyway to specify the sampled ratio now?

hayesall commented 4 years ago

@davidfstein I think the sampling_strategy parameter is what you're looking for. See the float case in the RandomOverSampling docs, or this sampling_strategy tutorial.

glemaitre commented 4 years ago

Just adding that ee deprecated ratio because it was not an intuitive name when we started to extend the way resampling should work. However, as mentioned in the previous message, sampling_strategy will support the same use case than ratio.

On Tue, 1 Sep 2020 at 21:57, Alexander L. Hayes notifications@github.com wrote:

@davidfstein https://github.com/davidfstein I think the sampling_strategy parameter is what you're looking for. See the float case in the RandomOverSampling https://imbalanced-learn.org/stable/generated/imblearn.under_sampling.RandomUnderSampler.html docs, or this sampling_strategy tutorial https://imbalanced-learn.org/stable/auto_examples/plot_sampling_strategy_usage.html .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/imbalanced-learn/issues/699#issuecomment-685098510, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY32P4YU74K7FZTYWG7HELSDVG3LANCNFSM4LHMGIPA .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

scikit-learn-contrib / imbalanced-learn