Open Haoqing-Wang opened 1 year ago
As we all know, the driven-audio feature a and the landmark representation l are just a vector, not a batch of vectors, so how can they be used in cross-attention module as Key and Value?
Did you understand how this works? I'm totally confused right nowðŸ˜.
As we all know, the driven-audio feature a and the landmark representation l are just a vector, not a batch of vectors, so how can they be used in cross-attention module as Key and Value?