roudimit / whisper-flamingo

[Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
https://arxiv.org/abs/2406.10082
Other
87 stars 4 forks source link

similarity to whispering-llama #1

Open huckiyang opened 4 months ago

huckiyang commented 4 months ago

thanks for the nice work @roudimit, I just have a deeper look on the decoder k,v from the visual features.

It is similar to the whispering-llama on the cross-modal decoder feature merging?

just wondering for education purpose on the difference of the fusion architecture design.

roudimit commented 4 months ago

Hi @huckiyang, thanks for the interest! We inserted new cross-attention layers in Whisper's decoder to attend to the visual features from AV-HuBERT. The new cross-attention layers are initialized from random. We also apply tanh gating in the cross-attention layers so that the layers initially bypass the attention to the visual features.

It looks similar to whispering-llama, except whispering-llama injects the Whisper encoder features into LLaMa via new cross-attention layers that are initialized from Whisper's decoder cross-attention layers. It seems like we could combine Whisper-Flamingo and whispering-llama!

Feel free to discuss further :)