Questions about Model - Githubissues

lyuchenyang / Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

Apache License 2.0

1.54k stars 125 forks source link

Dear Author,

I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:

The dimensions of different modalities are vastly different. How do you balance the information from different modalities in your network?
In real-world scenarios, there may be missing modalities. Do you need to input information from all three modalities during the training/inference process of your model, or can you only input certain modalities?

I am looking forward to your work and hope to see your article soon. Thank you.

Best regards, RitchieAlpha

1. We employ linear layers to transform them into the same dimension and then use attention functions to align multi-modal features to the embeddings of LLM. 2. Good question. During training, we use all modalities and for the examples without some modalities we simple use a placeholer (e.g. tensor of zeros for image/video/audio) for that modality.

lyuchenyang / Macaw-LLM

Questions about Model #1