Tiled-mode for projector

woctezuma commented 4 years ago

Tiled-mode for projector seems to lead to better fits to real images compared to Nvidia's original project. It consists in using 18 dimensions for the latent instead of just the first one, and I think you are the one who introduced this change the first.

Would you mind explaining the idea behind it and why it works?

And whether it has limitations based on what we want to do with the latent? For instance, is it fitting noise?

It is mentioned in this pull request: https://github.com/rolux/stylegan2encoder/pull/9

And this short commit which showcases the vanilla mode and the tiled mode: https://github.com/kreativai/stylegan2encoder/commit/2036fb85035aaa828bc762058fd495076eb6554f

And this longer commit of yours: https://github.com/rolux/stylegan2encoder/commit/bc3face1b10a4c31340e32be439be6159bd12620#diff-bc58d315f42a097b984deff88b4698b5

woctezuma commented 4 years ago

Limitations of W(18,*) compared to W(1,*) are mentioned in:

https://github.com/rolux/stylegan2encoder/issues/2#issuecomment-570191777

Encoder output can have high visual quality, but bad semantics.

The W(18, 512) projector, for example, explores W-space so well that its output, beyond a certain number of iterations, becomes meaningless. It is so far from w_avg that the usual applications -- interpolate with dlatents, apply direction vectors obtained from samples, etc. -- won't work as expected.

For comparison: the mean semantic quality of Z -> W(1, 512) dlatents is 0.44. 2 is okay (1000 iterations with W(18, 512) projector). 4 is bad (5000 iterations with W(18, 512) projector).

https://github.com/rolux/stylegan2encoder/issues/2#issuecomment-570419434

On the left is a Z -> W(1, 512) face, ψ=0.75, with a semantics score of 0.28. On the right is the same face projected into W(18, 512), it=5000, with a score of 3.36. They both transition along the same "surprise" vector. On the left, this looks gimmicky, but visually okay. On the right, you have to multiply the vector by 10 to achieve a comparable amount of change, which leads to obvious artifacts. As long as you obtain your vectors from annotated Z -> W(1, 512) samples, you're going to run into this problem.

Should you just try to adjust your vectors more cleverly, or find better ones? My understanding is that this won't work, and that there is no outer W-space where you can smoothly interpolate between all the cool projections that are missing from the regular inner W-space mappings. (Simplified: Z is a unit vector, a point on a 512D-sphere. Ideally, young-old would be north-south pole, male-female east-west pole, smile-unsmile front-back pole, and so on. W(1, 512) is a learned deformation of that surface that accounts for the uneven distribution of features in FFHQ. W(18, 512) is a synthesizer option that allows for style mixing and can be abused for projection. But all the semantics of StyleGAN reside on W(1, 512). W(18, 512) vectors filled with 18 Z -> W(1, 512) mappings already belong to a different species. High-quality projections are paintings of faces.)

https://github.com/rolux/stylegan2encoder/issues/2#issuecomment-570548948

Single-layer mixing of a face with the projected face based on the "Mona Lisa" using W(18,*). Artifacts appear if using the projection result after 5000 iterations compared to the projection result with 1000 iterations.

An interesting paper (Image2StyleGAN) is mentioned here.

woctezuma commented 4 years ago

Limitations mentioned there:

https://github.com/Puzer/stylegan-encoder/issues/6#issuecomment-494203402

When experimenting with the net, I've noticed StyleGAN behaves much better when it comes to interpolation & mixing if you "play by the rules". eg, use a single 1x512 dlatent vector to represent your target image. With 18x512, we're kindof cheating. In fact, Image2Stylegan shows that you can encode images this way on a completely randomly initialized net! (although interpolation is pretty meaningless in this in this instance)

https://github.com/pender/stylegan-encoder

Why limit the encoded latent vectors to shape [1, 512] rather than [18, 512]?

The mapping network of the original StyleGAN outputs [1, 512] latent vectors, suggesting that the reconstructed images may better resemble the natural outputs of the StyleGAN network.

Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? (Abdal, Qin & Wonka 2019) demonstrated that use of the full [18, 512] latent space allows all manner of images to be reproduced by the pretrained StyleGAN network, even images highly dissimilar to training data, perhaps suggesting that the accuracy of the encoded images more reflects the amount of freedom afforded by the expanded latent vector than the domain expertise of the network.

https://github.com/pbaylies/stylegan-encoder/issues/1#issuecomment-500406890

my goal with this encoder is to be able to encode faces well into the latent space of the model; by constraining the total output to [1, 512] it's tougher to get a very realistic face without artifacts. Because the dlatents run from coarse to fine, it's possible to mix them for more variation and finer control over details, which NVIDIA does in the original paper. In my experience, an encoder trained like this does a good job of generating smooth and realistic faces, with less artifacts than the original generator.

I am open to having [1, 512] as an option for building a model, but not as the only option, because I don't believe it will ultimately perform as well for encoding as using the entire latent space -- but it will surely train faster!

woctezuma commented 4 years ago

It is actually discussed in the StyleGAN2 paper, because the trick to increase image quality has been used by some with StyleGAN1.

article

woctezuma commented 4 years ago

Also, in StyleGAN1 repository:

The dlatents array stores a separate copy of the same w vector for each layer of the synthesis network to facilitate style mixing.

https://github.com/NVlabs/stylegan#using-pre-trained-networks

woctezuma commented 3 years ago

Relevant paper discussing the trade-off between visual quality and semantic quality:

Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., & Cohen-Or, D. (2021). Designing an Encoder for StyleGAN Image Manipulation. arXiv preprint arXiv:2102.02766. https://arxiv.org/abs/2102.02766

It is done by the people behind https://github.com/orpatashnik/StyleCLIP

Semantic quality is called "editability". Visual quality is divided into two elements:

distortion (distance between the input and the projection)
perception (realism of the projection).

rolux / stylegan2encoder

Tiled-mode for projector #21