cannot split between train and test set, scientific integrity not possible

xuancong84 commented 4 years ago

In research, scientific integrity plays a very important part. One can publish very good papers by playing tricks between train and test set in order to get good results, but such results can never be applied in real life, because those tricks simply does not work in real-life applications.

Your framework provides fit, transform and fit_transform, the architecture is very sound. However, for every strategy, there should be an option specifying whether to take statistics from the training set or test set. For example, for 'mean imputation', should it be the train-set mean or test-set mean. These discussions give a very good explanation on that: 1 2

The above code shows one major issue of your framework. If you have 3 datasets, and you train imputation model on dataset 1 (using any of the strategies), and then, whether you test on dataset 2 or dataset 2&3 combined, you will get different results. Somehow, some of the statistics still comes from the test set. This will result in systematic inconsistency. Thanks!

kearnz commented 4 years ago

Hi @xuancong84

Thanks for your concern. A couple things to note first about raising issues:

Please provide code inline as opposed to a picture. Makes it easy to copy and paste.
If you use random numbers, please provide a seed of some sort to make the numbers reproducible.
Please label issues as a bug, new feature, question, etc.

Now, to address your concerns.

df1 working and df2 failing may have lead you to believe that the issue has to do with test data being used, but I can assure you that's not the case!
autoimpute uses train estimates only. There is no data leakage or usage of test to compute estimates. That being said, I can see why you think so. It's because of an unrelated bug.
The error you see above is actually a result of a separate edge case that has nothing to do with imputation. autoimpute uses decorators to validate data before performing imputations (see here). One of the decorators, check_nan_columns, raises the error you're seeing above because one of the columns is fully missing. In this case, that's column 4. It only has 1 value... np.nan.
This error should be raised in training only, not in test. The reason it should not be raised in test is because you already have the estimates from training (i.e. no data leakage!).

Here's some working code to demonstrate, using mean imputation. It shows training estimates are used.

Please let me know if this makes sense, and feel free to re-open the issue if I've missed your point or you have a counter-example. Happy to take another look!

np.random.seed(3)
n = 5
arr = np.random.uniform(high=5, size=(n, n))
for _ in range(3):
    arr[np.random.randint(n), np.random.randint(n)] = np.nan
df = pd.DataFrame(arr)
print(df)
print('-'*70)

# Build the model
imp = SingleImputer(strategy='mean')
model = imp.fit(df)

# Show the "training mean", which is then imputed in test.
for ix, col in model.statistics_.items():
    print(f'col {ix}: {col.statistics_}')

# Demonstrate two working examples, where the training mean is used to impute (i.e. no leakage)  
df1 = pd.DataFrame(np.array([[1,2,3,4,np.nan], [1,2,3,4,5]]))
print(df1)
print(imp.transform(df1))
df2 = pd.DataFrame(np.array([[1,2,3,4,5], [1,2,3,5,np.nan], [1,2,3,9,10], [1,2,3,4,np.nan]]))
print(df2)
print(imp.transform(df2))

# Demonstrate the case with a ValueError
df3 = pd.DataFrame(np.array([[1,2,3,4,np.nan]]))
print(df3)
print(imp.transform(df3))

xuancong84 commented 4 years ago

Hi @kearnz , thanks for your prompt reply. Apparantly you have not fully understood my question.

I am saying that for example, mean imputation should split into two kinds:

use the mean of the train set
use the mean of the test set

In case of 1, it should not raise any exception if any of the entire column in test set is NaN, because it should just use the trainset mean. And it is only in the case of 2, it should raise such error.

kearnz commented 4 years ago

In case 1, it only raises an error if ALL of the entire column in test set is NaN (this is a bug, it should be able to use to trainset mean). If not ALL, then the imputation proceeds using the trainset mean. That's what's happening in your picture. The last df has 1 row, and the 5th column (index 4) is NaN. Because it's NaN and there's only one row, that means the entire column is considered NaN.

Also, to this quote you provided:

The above code shows one major issue of your framework. If you have 3 datasets, and you train imputation model on dataset 1 (using any of the strategies), and then, whether you test on dataset 2 or dataset 2&3 combined, you will get different results. Somehow, some of the statistics still comes from the test set. This will result in systematic inconsistency. Thanks!

I tested that out with least squares, and I got the same results from dataset 2 and dataset 2&3 combined... can you show an example where they aren't the same?

Overall, autoimpute always uses models built on training data to perform future imputations. I haven't seen a use case where you'd use test data unless you were retraining.

xuancong84 commented 4 years ago

The defect can be worked around by one line of modification in autoimpute/utils/checks.py

Change from:

        if nc:
            err = f"All values missing in column(s) {nc}. Should be removed."
            raise ValueError(err)

to:

        if not kwargs.pop('allow_nan_column', False) and nc:
            err = f"All values missing in column(s) {nc}. Should be removed."
            raise ValueError(err)

kearnz / autoimpute

cannot split between train and test set, scientific integrity not possible #50