imlixinyang / HiSD

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement" (CVPR 2021 Oral).
Other
390 stars 49 forks source link

Questions about the attention map #19

Open HelenMao opened 3 years ago

HelenMao commented 3 years ago

Hi, I am trying the AFHQ dataset of your model and find your model can preserve the background of the source image very well. I think it is thanks to the attention map, and I visualize the attention map and find it can learn the mask.

However, when I am trying to copy the attention module to my own framework (this paper), I find it does not work at all and fail to learn the mask. The main difference between mine and yours lie in the mapping network usage and without KL/MMD related loss between the random noise distribution and the reference encoder embedding distribution (I directly replace your generator and fail to learn the mask too).

I am wondering do you have some experience with your attention map design. What do you think when it can learn the mask? It would be really great if you can share some experience with me, thanks a lot!

Looking forward to your reply!

imlixinyang commented 3 years ago

I've also tried the AFHQ dataset and found that HiSD only focuses on manipulating the shape and maintains the background and color, which will be presented in the camera ready supplemental material.

I think there are some key points why HiSD succeeds to learn the mask without any extra objective: 1. separate translator for each tag or semantic; 2. no diversification loss; and 3. applying the mask on the feature rather than the image (which means that both channel wise and spatial wise are important).

In previous works, a regularization objective is always needed, I think the reason is that a spatial-wise-only mask is hard to learn for the generator.

HelenMao commented 3 years ago

I think the tag operation may not influence since I use one tag when I am running the AFHQ dataset.

The diversification loss may have some influence, and I need to do more experiments.

I directly copy your generator (including both the translator and decoder) to my own framework. I think your generator do use both channel-wise and spatial-wise attention map. However, it still cannot learn the mask. Therefore, I think it may not be the main reason.