How can the driven-audio feature a and the landmark representation l be used for cross-attention module?

sstzal / DiffTalk

[CVPR2023] The implementation for "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation"

441 stars 41 forks source link

How can the driven-audio feature a and the landmark representation l be used for cross-attention module? #21

Open Haoqing-Wang opened 1 year ago

Haoqing-Wang commented 1 year ago

As we all know, the driven-audio feature a and the landmark representation l are just a vector, not a batch of vectors, so how can they be used in cross-attention module as Key and Value?

WoofGH commented 5 months ago

Did you understand how this works? I'm totally confused right now😭.