harlanhong / CVPR2022-DaGAN

Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation
https://harlanhong.github.io/publications/dagan.html
Other
958 stars 125 forks source link

Question about the Eqn.(9) and Fig.10 #1

Closed JialeTao closed 2 years ago

JialeTao commented 2 years ago

Hi, thanks for sharing the good work. After reading the paper, I have some confusion in understanding the attention process in equation (9).

  1. How to understand the physical meaning of the attenion? The query feature comes from the source depth map, while the key and value features come from the warped source feature; since the depth map has a different pose with the warped feaure, and according to the qkv attention, the re-represented feature should have spatial structure simialr with the query (the depth map here); so how to guarantee the refined feature $F_g$ has the pose of the driving image?
  2. Intuitively, features of different positions may have different relations with features of other postions; in Fig.10, it seems the attentions from different positions are always similar (i.e., both attend the mouth and eyes), how to understand this?
harlanhong commented 2 years ago

Thanks for your questions.

We treat the depth map as dense guidance for the human face generation. In the training process, we do not use the depth map of the driving image because it will introduce the shape information of the driving face (which is fatal in cross-identity). Therefore, we adopt the depth map of the source image in the attention module. We regard F_w as a fusion of source image and flow motion. In the attention module, the source depth map should also be able to capture the information of motion flow, so that their spatial information can be automatically aligned (we did not verify this one in paper). The reason why we do not directly use motion flow to warp the depth map is that the area of occlusion will be larger. We are further optimizing depth map estimation and investigating depth maps to further assist face generation tasks.

Here, the attention module aims to capture the expression-related micro facial movements of human faces. Therefore, These results are in line with our design purpose.

Your question inspired me a lot, and I can further design the attention module to make the depth map and f_w spatially aligned in some way explicitly. If you have more questions, please email me without any hesitation.

JialeTao commented 2 years ago

Thanks for the reply. Although I'm still a little confused, due to the implicitly aligned depth map with motion flow, I learned a lot from your statements.