Closed lihp11 closed 4 years ago
In the training stage, we compute the reconstruction loss. Therefore, we compute MSE for the known features. In the testing stage, we compute the imputation loss. Therefore, we compute MSE for the unknown features.
what is "tqdm",no this module?
Please see the following link. https://github.com/tqdm/tqdm
If the matrix I wanna to fill, such as Mr has a high missing rate, there are even some places where the entire row or the entire column is missing (the row corresponds to the time slice, the column corresponds to the feature information), but the matrix associated with Mr has three matrix X, Y, Z, for example. At the same time, X, Y, and Z are also missing in different degrees. It is necessary to comprehensively use Mr, and X, Y, and Z information to achieve the filling of the target matrix Mr. How do I modify your GAIN model?
I think the easiest way is just concatenating X, Y, Z to make a training set and train the GAIN algorithm. Then, put Mr as the testing set and see the results. Thanks.
Thanks a lot for your instant reply. Maybe I didn't make it clear. In fact, the 2D matrix X,Y,Z here has different dimensions. They are just supplementary information for the Matrix Mr ,which contain some properties for Mr. So it is impossible to imputate the missing value for Mr just based on X,Y,Z. I have tried matrix decomposition, tensor decomposition, SVR and other methods, but the RMSE obtained is not too satisfactory, so I want to try it again, but using GAN to imputate the vacancy value. However, the papers that use GAN method to achieve data filling are very rare, so I read your GAIN paper carefully, but since the problem I need to solve is somewhat different from your data, I can't directly apply it. I need to transform your GAIN model. I have such an idea that is modified the equations (2) and (3) in your GAIN paper to get a generator based on Mr,X, Y,Z, and M? How do you think? Maybe I need to make a big change to your model?
If X, Y, Z have some overlapped features in Mr, you can concatenate all X, Y, Z, and Mr in a row (each column represents the same feature). Then, missing features are treated as all missing. Then, run GAIN. If there are no overlapping features, I think GAIN is not an appropriate approach.
Thanks a lot .I get it. I have try it on some data. Now I have another question. As shown in GAIN paper and code:
G_loss1 = -tf.reduce_mean((1-M) tf.log(D_prob + 1e-8)) / tf.reduce_mean(1-M) MSE_train_loss = tf.reduce_mean((M X - M * G_sample)*2) / tf.reduce_mean(M) G_loss = G_loss1 + alpha MSE_train_loss
Here G_loss1 is a probability value which is very little , but MSE_train_loss is very large(such as 30-60 in original X ),then just add these two terms together with parameter alpha . I think it doesn't seem reasonable to do so. Even though ,when I try it ,but I get some values imputed with 0s or 1s in my imputed data,that is not what I want. How do you think?
G_loss1 is usually between 0 to 2. Therefore, we first normalize the features that MSE is around 0 to 1. Then, we adjust alpha for each dataset and task. Note that MSE loss is only applied to the observed variables and G_loss1 is only applied to the unobserved variables.
Thanks very much.
Hi ,I noticed that in your code, MSE_train_loss = tf.reduce_mean((M X - M G_sample)2) / tf.reduce_mean(M) MSE_test_loss = tf.reduce_mean(((1-M) X - (1-M)G_sample)2)// tf.reduce_mean(1-M) question is why "1-M" in the test loss but "M" in the train loss,
Thanks!