kearnz / autoimpute

Python package for Imputation Methods
MIT License
241 stars 19 forks source link

ValueError: Found array with 0 sample(s) (shape=(0, 1904)) while a minimum of 1 is required. #65

Closed apavlo89 closed 3 years ago

apavlo89 commented 3 years ago

Hello everyone,

I'm trying to run autoimpute but from my understanding it is telling me that I have a column or a row that is empty??? I checked and that is not the case! Any idea what may be causing this? Here is the code I use below

from autoimpute.imputations import SingleImputer, MultipleImputer, MiceImputer

import pandas as pd

dataset = pd.read_csv('C:/Users/apavl/Dropbox/A+T/REEG/AQ_Database - test.csv')
dataset_list = list(dataset.columns)

imp = MiceImputer()
imp.fit_transform(dataset)
kearnz commented 3 years ago

Hi @apavlo89

Can you please post the full error traceback you see when you try to execute this code? Hard to say right now from the code above, could be any number of things.

Thanks, Joe

apavlo89 commented 3 years ago
runfile('C:/Users/apavl/Dropbox/A+T/REEG/miceforest.py', wdir='C:/Users/apavl/Dropbox/A+T/REEG')
WARNING (theano.configdefaults): g++ not available, if using conda: `conda install m2w64-toolchain`
C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\theano\configdefaults.py:697: UserWarning: DeprecationWarning: there is no c++ compiler.This is deprecated and with Theano 0.11 a c++ compiler will be mandatory
  "DeprecationWarning: there is no c++ compiler."
WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass fit_intercept=True, normalize=False, copy_X=True, n_jobs=None as keyword args. From version 0.25 passing these as positional arguments will result in an error
  FutureWarning)
Traceback (most recent call last):

  File "C:\Users\apavl\Dropbox\A+T\REEG\miceforest.py", line 14, in <module>
    imp.fit_transform(dataset)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\imputations\dataframe\multiple_imputer.py", line 231, in fit_transform
    return self.fit(X, y).transform(X)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\utils\checks.py", line 61, in wrapper
    return func(d, *args, **kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\utils\checks.py", line 126, in wrapper
    return func(d, *args, **kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\utils\checks.py", line 173, in wrapper
    return func(d, *args, **kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\imputations\dataframe\multiple_imputer.py", line 188, in fit
    imputer.fit(X)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\utils\checks.py", line 61, in wrapper
    return func(d, *args, **kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\utils\checks.py", line 126, in wrapper
    return func(d, *args, **kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\utils\checks.py", line 173, in wrapper
    return func(d, *args, **kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\imputations\dataframe\single_imputer.py", line 190, in fit
    imputer.fit(x_, y_)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\imputations\series\default.py", line 395, in fit
    super().fit(X, y)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\imputations\series\default.py", line 186, in fit
    stats = {"param": self.num_imputer.fit(X, y),

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\autoimpute\imputations\series\pmm.py", line 117, in fit
    y_pred = self.lm.fit(X, y).predict(X)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\sklearn\linear_model\_base.py", line 506, in fit
    y_numeric=True, multi_output=True)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\sklearn\base.py", line 432, in _validate_data
    X, y = check_X_y(X, y, **check_params)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\sklearn\utils\validation.py", line 802, in check_X_y
    estimator=estimator)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)

  File "C:\Users\apavl\anaconda3\envs\neuroscience\lib\site-packages\sklearn\utils\validation.py", line 653, in check_array
    context))

ValueError: Found array with 0 sample(s) (shape=(0, 1895)) while a minimum of 1 is required.
apavlo89 commented 3 years ago

I'm uploading the database with some features removed so you can see for yourself. Thank you so much for your swift response!

database.zip

kearnz commented 3 years ago

@apavlo89

Bit of a tricky one to explain here - to understand, first try the same code but using the first four columns of your dataset, then five, then six, etc... you'll see that they should all work just fine.

So why doesn't the full dataset work? Well when you do fit_transform with all default values, a fair amount happens under the hood. in your example, you don't specify a method, so the default method becomes pmm. One aspect of pmm includes fitting linear regressions on subsets of each column using some or every other column as predictors assuming the columns meet specific criteria.

now the error you see is actually an sklearn error, and it's thrown when a linear regression's x or y inputs are malformed. While no specific column in your dataset is malformed on its own, some combination of columns, after filtering, are in fact malformed - hence the error. I'd bet it results from this line here: https://github.com/kearnz/autoimpute/blob/master/autoimpute/imputations/dataframe/single_imputer.py#L196

I'm not surprised you ran into this issue. You're using >1900 columns with <40 rows. Try refining your sample. Do you have more data? Can you use fewer columns? I'd doubt you'd need 1899 columns to predict the missingness in each other column, so you'll have to play around with the MiceImputer inputs.

In the meantime, I can look into making this error more informative, but the error thrown is happening for a reason.

Let me know if that makes sense.

apavlo89 commented 3 years ago

Yes, reducing the number of features makes it work, thank you. My problem now is how do I get the database with filled-in missing values with MICE?

When I use Multiple Imputer with mean it works as I get database variable with predicted values for missing values.

'''MCAR mean imputation'''

# create the mean imputer
mi_mean_mcar = MultipleImputer(
    strategy="mean", n=5, return_list=True, seed=101
)

# print the mean imputer to console
print(mi_mean_mcar)

# perform mean imputation procedure
imp_mean_mcar = mi_mean_mcar.fit_transform(mcar)

But just filling missing values with the mean value is not what I am after. Anyone can do that easily with skilearn's SimpleImputer

kearnz commented 3 years ago

Hi @apavlo89,

mean imputation is one of many strategies that autoimpute offers. To understand the different strategies, I'd suggest reading the docs here

If you're new to imputation, it's best to leave the strategy blank. The MiceImputer (or MultipleImputer) will then pickup the default strategy. For numerical data, the default is pmm, or predictive mean matching. It will be your best bet in most scenarios.

The original code you sent used pmm and the MiceImputer already, but the problem is that you were using too many features for each column you want to impute. This will never be a problem for mean imputation, as mean imputation is univariate and independent (i.e. each column depends on itself only and no other columns are used). That's why you can impute the full dataset with mean, but you run into issues with pmm.

kearnz commented 3 years ago

closing this. if you have further questions feel free to let me know.