haihuangcode / CMG

The official implementation of Achieving Cross Modal Generalization with Multimodal Unified Representation (NeurIPS '23)
167 stars 2 forks source link

encoders design question #3

Closed makimon123 closed 8 months ago

makimon123 commented 8 months ago

Hello, your work has inspired me a lot! I have a question about semantic encoders and modal-specific encoders, what do you need to consider when designing them, and are complex encoders helpful for the experimental results?

Looking forward to hearing from you!

haihuangcode commented 8 months ago

Thank you for your recognition of our work.

semantic encoders and modal-specific encoders, what do you need to consider when designing them?

Semantic encoders need to ensure that the output shapes from different modalities are the same (because they use the same codebook for quantization), whereas modality-specific encoders do not have this constraint. As for how the encoders are designed, it depends on what information you want to extract from the modality.

Are complex encoders helpful for the experimental results?

Complex encoders are certainly helpful for results within the same modality (like video-to-video), but not necessarily for cross-modal generalization (like video-to-audio). This is because complex encoders might understand each modality's information more profoundly. To achieve this, they may have to focus on modality-specific information, which could, to some extent, lead to poorer cross-modal generalization. To improve cross-modal generalization, the focus should be more on the design of the DCID part. However, the specific results might still need to be proven experimentally.

makimon123 commented 8 months ago

Thank you very much for your questions!