Open huckiyang opened 4 months ago
Hi @huckiyang, thanks for the interest! We inserted new cross-attention layers in Whisper's decoder to attend to the visual features from AV-HuBERT. The new cross-attention layers are initialized from random. We also apply tanh gating in the cross-attention layers so that the layers initially bypass the attention to the visual features.
It looks similar to whispering-llama, except whispering-llama injects the Whisper encoder features into LLaMa via new cross-attention layers that are initialized from Whisper's decoder cross-attention layers. It seems like we could combine Whisper-Flamingo and whispering-llama!
Feel free to discuss further :)
thanks for the nice work @roudimit, I just have a deeper look on the decoder
k,v
from the visual features.It is similar to the whispering-llama on the cross-modal decoder feature merging?
just wondering for education purpose on the difference of the fusion architecture design.