Closed xuancong84 closed 4 years ago
Hi @xuancong84
Thanks for your concern. A couple things to note first about raising issues:
Now, to address your concerns.
df1
working and df2
failing may have lead you to believe that the issue has to do with test data being used, but I can assure you that's not the case!autoimpute
uses train estimates only. There is no data leakage or usage of test to compute estimates. That being said, I can see why you think so. It's because of an unrelated bug.autoimpute
uses decorators to validate data before performing imputations (see here). One of the decorators, check_nan_columns
, raises the error you're seeing above because one of the columns is fully missing. In this case, that's column 4. It only has 1 value... np.nan
. Here's some working code to demonstrate, using mean imputation. It shows training estimates are used.
Please let me know if this makes sense, and feel free to re-open the issue if I've missed your point or you have a counter-example. Happy to take another look!
np.random.seed(3)
n = 5
arr = np.random.uniform(high=5, size=(n, n))
for _ in range(3):
arr[np.random.randint(n), np.random.randint(n)] = np.nan
df = pd.DataFrame(arr)
print(df)
print('-'*70)
# Build the model
imp = SingleImputer(strategy='mean')
model = imp.fit(df)
# Show the "training mean", which is then imputed in test.
for ix, col in model.statistics_.items():
print(f'col {ix}: {col.statistics_}')
# Demonstrate two working examples, where the training mean is used to impute (i.e. no leakage)
df1 = pd.DataFrame(np.array([[1,2,3,4,np.nan], [1,2,3,4,5]]))
print(df1)
print(imp.transform(df1))
df2 = pd.DataFrame(np.array([[1,2,3,4,5], [1,2,3,5,np.nan], [1,2,3,9,10], [1,2,3,4,np.nan]]))
print(df2)
print(imp.transform(df2))
# Demonstrate the case with a ValueError
df3 = pd.DataFrame(np.array([[1,2,3,4,np.nan]]))
print(df3)
print(imp.transform(df3))
Hi @kearnz , thanks for your prompt reply. Apparantly you have not fully understood my question.
I am saying that for example, mean imputation should split into two kinds:
In case of 1, it should not raise any exception if any of the entire column in test set is NaN, because it should just use the trainset mean. And it is only in the case of 2, it should raise such error.
In case 1, it only raises an error if ALL of the entire column in test set is NaN (this is a bug, it should be able to use to trainset mean). If not ALL, then the imputation proceeds using the trainset mean. That's what's happening in your picture. The last df has 1 row, and the 5th column (index 4) is NaN. Because it's NaN and there's only one row, that means the entire column is considered NaN.
Also, to this quote you provided:
The above code shows one major issue of your framework. If you have 3 datasets, and you train imputation model on dataset 1 (using any of the strategies), and then, whether you test on dataset 2 or dataset 2&3 combined, you will get different results. Somehow, some of the statistics still comes from the test set. This will result in systematic inconsistency. Thanks!
I tested that out with least squares, and I got the same results from dataset 2 and dataset 2&3 combined... can you show an example where they aren't the same?
Overall, autoimpute
always uses models built on training data to perform future imputations. I haven't seen a use case where you'd use test data unless you were retraining.
The defect can be worked around by one line of modification in autoimpute/utils/checks.py
Change from:
if nc:
err = f"All values missing in column(s) {nc}. Should be removed."
raise ValueError(err)
to:
if not kwargs.pop('allow_nan_column', False) and nc:
err = f"All values missing in column(s) {nc}. Should be removed."
raise ValueError(err)
In research, scientific integrity plays a very important part. One can publish very good papers by playing tricks between train and test set in order to get good results, but such results can never be applied in real life, because those tricks simply does not work in real-life applications.
Your framework provides fit, transform and fit_transform, the architecture is very sound. However, for every strategy, there should be an option specifying whether to take statistics from the training set or test set. For example, for 'mean imputation', should it be the train-set mean or test-set mean. These discussions give a very good explanation on that: 1 2
The above code shows one major issue of your framework. If you have 3 datasets, and you train imputation model on dataset 1 (using any of the strategies), and then, whether you test on dataset 2 or dataset 2&3 combined, you will get different results. Somehow, some of the statistics still comes from the test set. This will result in systematic inconsistency. Thanks!