Closed woctezuma closed 4 years ago
Limitations of W(18,*)
compared to W(1,*)
are mentioned in:
Encoder output can have high visual quality, but bad semantics.
The
W(18, 512)
projector, for example, exploresW
-space so well that its output, beyond a certain number of iterations, becomes meaningless. It is so far fromw_avg
that the usual applications -- interpolate with dlatents, apply direction vectors obtained from samples, etc. -- won't work as expected.For comparison: the mean semantic quality of
Z -> W(1, 512)
dlatents is 0.44. 2 is okay (1000 iterations withW(18, 512)
projector). 4 is bad (5000 iterations withW(18, 512)
projector).
On the left is a
Z -> W(1, 512)
face, ψ=0.75, with a semantics score of 0.28. On the right is the same face projected intoW(18, 512)
, it=5000, with a score of 3.36. They both transition along the same "surprise" vector. On the left, this looks gimmicky, but visually okay. On the right, you have to multiply the vector by 10 to achieve a comparable amount of change, which leads to obvious artifacts. As long as you obtain your vectors from annotatedZ -> W(1, 512)
samples, you're going to run into this problem.Should you just try to adjust your vectors more cleverly, or find better ones? My understanding is that this won't work, and that there is no outer
W
-space where you can smoothly interpolate between all the cool projections that are missing from the regular innerW
-space mappings. (Simplified:Z
is a unit vector, a point on a 512D-sphere. Ideally, young-old would be north-south pole, male-female east-west pole, smile-unsmile front-back pole, and so on.W(1, 512)
is a learned deformation of that surface that accounts for the uneven distribution of features in FFHQ.W(18, 512)
is a synthesizer option that allows for style mixing and can be abused for projection. But all the semantics of StyleGAN reside onW(1, 512)
.W(18, 512)
vectors filled with 18Z -> W(1, 512)
mappings already belong to a different species. High-quality projections are paintings of faces.)
Single-layer mixing of a face with the projected face based on the "Mona Lisa" using
W(18,*)
. Artifacts appear if using the projection result after 5000 iterations compared to the projection result with 1000 iterations.
An interesting paper (Image2StyleGAN) is mentioned here.
Related issues: https://github.com/pbaylies/stylegan-encoder/issues/1 https://github.com/pbaylies/stylegan-encoder/issues/2 https://github.com/Puzer/stylegan-encoder/issues/6
Limitations mentioned there:
When experimenting with the net, I've noticed StyleGAN behaves much better when it comes to interpolation & mixing if you "play by the rules". eg, use a single 1x512 dlatent vector to represent your target image. With 18x512, we're kindof cheating. In fact, Image2Stylegan shows that you can encode images this way on a completely randomly initialized net! (although interpolation is pretty meaningless in this in this instance)
Why limit the encoded latent vectors to shape [1, 512] rather than [18, 512]?
- The mapping network of the original StyleGAN outputs [1, 512] latent vectors, suggesting that the reconstructed images may better resemble the natural outputs of the StyleGAN network.
- Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? (Abdal, Qin & Wonka 2019) demonstrated that use of the full [18, 512] latent space allows all manner of images to be reproduced by the pretrained StyleGAN network, even images highly dissimilar to training data, perhaps suggesting that the accuracy of the encoded images more reflects the amount of freedom afforded by the expanded latent vector than the domain expertise of the network.
my goal with this encoder is to be able to encode faces well into the latent space of the model; by constraining the total output to [1, 512] it's tougher to get a very realistic face without artifacts. Because the dlatents run from coarse to fine, it's possible to mix them for more variation and finer control over details, which NVIDIA does in the original paper. In my experience, an encoder trained like this does a good job of generating smooth and realistic faces, with less artifacts than the original generator.
I am open to having [1, 512] as an option for building a model, but not as the only option, because I don't believe it will ultimately perform as well for encoding as using the entire latent space -- but it will surely train faster!
It is actually discussed in the StyleGAN2 paper, because the trick to increase image quality has been used by some with StyleGAN1.
Also, in StyleGAN1 repository:
The dlatents array stores a separate copy of the same w vector for each layer of the synthesis network to facilitate style mixing.
https://github.com/NVlabs/stylegan#using-pre-trained-networks
Relevant paper discussing the trade-off between visual quality and semantic quality:
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., & Cohen-Or, D. (2021). Designing an Encoder for StyleGAN Image Manipulation. arXiv preprint arXiv:2102.02766. https://arxiv.org/abs/2102.02766
It is done by the people behind https://github.com/orpatashnik/StyleCLIP
Semantic quality is called "editability". Visual quality is divided into two elements:
Tiled-mode for projector seems to lead to better fits to real images compared to Nvidia's original project. It consists in using 18 dimensions for the latent instead of just the first one, and I think you are the one who introduced this change the first.
Would you mind explaining the idea behind it and why it works?
And whether it has limitations based on what we want to do with the latent? For instance, is it fitting noise?
It is mentioned in this pull request: https://github.com/rolux/stylegan2encoder/pull/9
And this short commit which showcases the vanilla mode and the tiled mode: https://github.com/kreativai/stylegan2encoder/commit/2036fb85035aaa828bc762058fd495076eb6554f
And this longer commit of yours: https://github.com/rolux/stylegan2encoder/commit/bc3face1b10a4c31340e32be439be6159bd12620#diff-bc58d315f42a097b984deff88b4698b5