iffsid / mmvae

Multimodal Mixture-of-Experts VAE
GNU General Public License v3.0
188 stars 41 forks source link

Reproduce results in Table 2 and Table 4 #2

Closed Faye3321 closed 4 years ago

Faye3321 commented 4 years ago

Hi! Thanks for sharing this great project! I trained the model with your suggested settings and also evaluated your provided trained model, but in both ways I can't reproduce results as reported in Table 2 and Table 4, especially joint coherence. Can you give any hints? Thanks.

YugeTen commented 4 years ago

Hi, thank you for the interest in our work!

Could you possibly share the args you ran the experiments on and the results you are getting? Thanks!

Faye3321 commented 4 years ago

For training, I was using your suggested settings. For your provided trained model, the joint coherence on CUB I got is around 0.1 (VS 0.263 in Table 4)

YugeTen commented 4 years ago

Updates

Thanks for bringing the issues to our attention. We've now tracked down the reason for discrepancies between the released code and the reported results -- in the effort to clean up and publish our code, there were a couple of minor things we missed out.

  1. The (fixed) scales for the individual-modality likelihoods were transferred incorrectly---should have been 0.75 instead of 0.1;
  2. We didn't include the trained FastText embeddings used to compute CCA and cross/joint coherence scores.;

We have fixed the above in the most recent commit. Along with the code update, we have also uploaded new pretrained-models for both MNIST-SVHN and CUB datasets that will reproduce similar results to what's reported in Table 2 and Table 4 of our paper -- see README for more details.

We do apologise for the confusion this inconsistency caused, and thank you again for bringing this forward.

Explanation for CUB discrepancy

To expand a bit on (2), for measuring cross/joint coherence on CUB, we use off-the-shelf ResNets for the cub-image data, but train FastText embeddings for cub-sentence data -- since it's vocabulary is quite different from what is typically used to train FastText.

We then use these embeddings (ResNet, trained FastText) to compute CCA on the ground-truth image and sentence training data, and use the learnt embeddings to compute the correlation for generated samples from our model.

The learnt embeddings however, can vary quite a bit due to the limited dataset size. The embeddings we used to report results in the paper were not saved with our models, so re-computing them as part of the analyses can result in different numeric values including for the baseline. Note that the relative performance of our model against the baseline remains the same, just that the numbers can be different.

Reproducing CUB results (Table 4)

We have done a quick search for the FastText embeddings that produce the same results on the baseline as reported in the paper, and re-computed the CCA and cross/joint coherence scores on our model with this. To produce similar results to what's reported in our paper, download the zip file here and do the following:

  1. Move cub.all, cub.emb, cub.pc to under data/cub/oc:3_sl:32_s:300_w:3/;
  2. Move the rest of the files, i.e. emb_mean.pt, emb_proj.pt, images_mean.pt, im_proj.pt to path/to/trained/model/folder/;
  3. Set the RESET variable in src/report/analyse_cub.py to False (line 21).

With these two fixes, the results from the code match those in the paper (with even improved cross-coherence scores on cub )

Faye3321 commented 4 years ago

Thanks for your clarifications