lyuchenyang / Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
Apache License 2.0
1.54k stars 125 forks source link

Questions about Model #1

Closed RitchieAlpha closed 1 year ago

RitchieAlpha commented 1 year ago

Dear Author,

I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:

  1. The dimensions of different modalities are vastly different. How do you balance the information from different modalities in your network?
  2. In real-world scenarios, there may be missing modalities. Do you need to input information from all three modalities during the training/inference process of your model, or can you only input certain modalities?

I am looking forward to your work and hope to see your article soon. Thank you.

Best regards, RitchieAlpha

lyuchenyang commented 1 year ago

Hi,

Thank you for your kind words and appreciation for our open-source project. We're glad that our neural network model has made a positive impression on you. I'll be happy to address your questions:

1. We employ linear layers to transform them into the same dimension and then use attention functions to align multi-modal features to the embeddings of LLM.
2. Good question. During training, we use all modalities and for the examples without some modalities we simple use a placeholer (e.g. tensor of zeros for image/video/audio) for that modality.

Thanks again for your interest and attention.

Chenyang