Closed RitchieAlpha closed 1 year ago
Hi,
Thank you for your kind words and appreciation for our open-source project. We're glad that our neural network model has made a positive impression on you. I'll be happy to address your questions:
1. We employ linear layers to transform them into the same dimension and then use attention functions to align multi-modal features to the embeddings of LLM.
2. Good question. During training, we use all modalities and for the examples without some modalities we simple use a placeholer (e.g. tensor of zeros for image/video/audio) for that modality.
Thanks again for your interest and attention.
Chenyang
Dear Author,
I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:
I am looking forward to your work and hope to see your article soon. Thank you.
Best regards, RitchieAlpha