Baselines calculated only on test set (?)

jsyoon0823 / GAIN

Codebase for Generative Adversarial Imputation Networks (GAIN) - ICML 2018

365 stars 152 forks source link

Baselines calculated only on test set (?) #7

Closed sdimi closed 4 years ago

sdimi commented 5 years ago

Hi,

By reading the paper I think that the baselines (like MICE, missforest etc.) are calculated only on the test-set. On the other hand, GAIN learns a model from the bigger training set and then predicts on the test set.

What are your thoughts on that subtle difference?

zaythedatascientist commented 5 years ago

I'm also getting better results with baselines (missforest and sklearn's MICE) than GAIN. I'm using the default configuration for baselines and using the code/parameters provided for GAIN in this implementation on both Letter and Spam datasets.

jsyoon0823 commented 4 years ago

For paper writing, we explicitly divide the data into train/test and train all the models (including GAIN, MICE, and MissForest) on the train data only. Then, we use the trained data to test on the testing data. However, in this repository, I think that people usually do imputation before dividing the data for further model developing. Therefore, I do not divide the train/test in this repository.