similarity to whispering-llama

roudimit / whisper-flamingo

[Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Other

87 stars 4 forks source link

Hi @huckiyang, thanks for the interest! We inserted new cross-attention layers in Whisper's decoder to attend to the visual features from AV-HuBERT. The new cross-attention layers are initialized from random. We also apply tanh gating in the cross-attention layers so that the layers initially bypass the attention to the visual features.

It looks similar to whispering-llama, except whispering-llama injects the Whisper encoder features into LLaMa via new cross-attention layers that are initialized from Whisper's decoder cross-attention layers. It seems like we could combine Whisper-Flamingo and whispering-llama!

Feel free to discuss further :)

roudimit / whisper-flamingo

similarity to whispering-llama #1