Open hubert0527 opened 3 years ago
Thanks for the question -- I'll speak to (1) a bit. The act of subsampling the feature map (not blurring!) already commits you to either remove or misrepresent high-frequency information. Without blurring, you are misrepresenting it. That's what aliasing is; high-freq information gets entangled into the low-freq. The act of blurring is saying you would rather not represent than actively misrepresent the information.
Thanks for the comments! Now I understand the intuitions! However, regarding the subsampling part, the SwappingAutoencoder uses strided convolution to reduce the spatial dimension in the encoder. Consider it is learnable and represents a mixing set of high-pass or low-pass filters, I'm not sure if it is safe to say that it guarantees the high-frequency information is either removed or misrepresented[1]. In contrast, a blur operation is a deterministic low-pass filter that guarantees some information is eliminated.
[1] Consider the input/output images are discrete and finite (i.e, 0-255 in uint8 than normalize to [-1, 1] in float32), and the intermediate features are discrete (but much finer-grained than image colors) and infinite with float32, the cardinality of all intermediate features are much larger than the input/output images. It is a bit hard to reject the possibility that the encoder can still preserve the high-frequency information in certain ways, despite it is also empirically known that the existing autoencoders are still far from perfect reconstructions.
Note the blur happens after the conv-relu feature extractor (which is free to learn hf/lf filters), immediately before subsampling (which will cause aliasing)
Really impressive work and high-quality code release! I found several intriguing design choices while digging into the codebase, and looking for some clarifications or explanations of them:
The encoder architecture seems partially borrow the StyleGAN2 designs with blur operations in the conv layers (I suppose is for anti-aliasing). However, the blur operations also wipe out some of the high-frequency information, which should be crucial for detail reconstruction. Despite the high-frequency information is later infused with randomized noise injection in the decoder, it can never be a faithful reconstruction of the input. It seems to me that the reconstruction should be more important than anti-aliasing. Could you clarify a bit on this design choice?
Similar to 1., the randomized noises injection in the decoder has no information from the input image, thus it should negatively affect the reconstruction quality. It seems a bit counter-intuitive to me in terms of image reconstruction.
Sincerely sorry for the excessively long questions and looking forward to your answers!