Closed JialeTao closed 2 years ago
Thanks for your questions.
We treat the depth map as dense guidance for the human face generation. In the training process, we do not use the depth map of the driving image because it will introduce the shape information of the driving face (which is fatal in cross-identity). Therefore, we adopt the depth map of the source image in the attention module. We regard F_w as a fusion of source image and flow motion. In the attention module, the source depth map should also be able to capture the information of motion flow, so that their spatial information can be automatically aligned (we did not verify this one in paper). The reason why we do not directly use motion flow to warp the depth map is that the area of occlusion will be larger. We are further optimizing depth map estimation and investigating depth maps to further assist face generation tasks.
Here, the attention module aims to capture the expression-related micro facial movements of human faces. Therefore, These results are in line with our design purpose.
Your question inspired me a lot, and I can further design the attention module to make the depth map and f_w spatially aligned in some way explicitly. If you have more questions, please email me without any hesitation.
Thanks for the reply. Although I'm still a little confused, due to the implicitly aligned depth map with motion flow, I learned a lot from your statements.
Hi, thanks for sharing the good work. After reading the paper, I have some confusion in understanding the attention process in equation (9).