Closed diaodeyi closed 3 years ago
The design to learn an unsupervised mask is not new in image-to-image translation area. This design always encourage the translation to focus on specific area of the image rather than the whole. You can find the works which also use this design in Sec.3.3 in the paper. After encoding, the sizes of the image feature is of cxhxw, where c is the channels. In our experiments, we find that using only spatial-wise mask (of sizes hxw) failed to help our disentanglement (see Sec.4.3 ablation study). Therefore, we add channel wise to the mask (of sizes cxhxw). We think this is because that the encoder has not encoded the image into a highly disentangled feature space (where the feature of each tag is separated into some specific channels of the feature). You can also change the code (of "Class Translator" in /core/networks.py) if you want to have a try in other settings.
Thank you
Hi, thanks for your beautiful work, I want to konw the reason for the design about the m、f of the translator , Is there any reference work , or you design this by experiment? And as your paper mentioned "The attention mask in our translator is both spatial- wise and channel-wise." can you explain specifically ?