Open IcurasLW opened 4 months ago
In addition
The provided repository trained the encoders of both modalities in full modality setting. When it comes to the modality fusion, it loads the pre-trained encoder weights (trained on FULL) in the missing modality setting under the same dataset. THIS IS A CHEATING. The encoder has seen all available data in the pre-train phrase.
Hi IcurasLW,
Thanks for your question. The sound encoder is trained using only PARTIAL data, not the full dataset.
The AV-MNIST dataset contains 1,500 samples across 10 classes (1,050 for training and 450 for testing). We use the parameter "per_class_num" to control the number of samples used for training. For example, "per_class_num=21" means that 21 samples per class are used, totaling 210 samples (20% of the training samples). In our experiment, we assume the image data is complete (per_class_num=105) while the sound data is incomplete (e.g., per_class_num=21).
Please leave a comment if you have any further questions.
No one in the issues can find the code to handle the missing modality. In the proposed scripts, full modality is available in the test data. All the results are not reproducible.