Applying encoder for StyleSpace

omertov / encoder4editing

Official implementation of "Designing an Encoder for StyleGAN Image Manipulation" (SIGGRAPH 2021) https://arxiv.org/abs/2102.02766

MIT License

945 stars 154 forks source link

Applying encoder for StyleSpace #41

Closed markduon closed 3 years ago

markduon commented 3 years ago

Hi sir,

I recently have read a paper called StyleSpace. 1) Can we apply the GANs inversion technique to invert the image into latent space of StyleSpace? It seems that this is S space (not W or W+) 2) Should I retrain the psp model with StyleSpace Generator because StyleSpace generator seems to be a little different compared to StyleGan2 generator?

omertov commented 3 years ago

Hi @duongquangvinh!

I believe it is possible to train an encoder into the StyleSpace S, but it will take some modifications to the encoder architecture as the S latent space is larger then the (18, 512) Wk_* (W+) space.
I think it will be better to use the StyleSpace generator, which may be an exact copy of the official FFHQ (for example) SG2 with only small API changes. As for retraining the pSp model (containing the e4e encoder or not), it will be needed as you are changing the target latent space (from "W+" to S).

Another option to obtain a representation in the style space is to first obtain a (18, 512) "W+" code using any pretrained encoder (ReStyle. e4e, pSp etc.) or optimization, and then extracting the StyleSpace representation during a forward pass of the generator with the obtained "W+" code (which are the relevant activations as described in the StyleSpace paper).

Hope it helps, Best, Omer

markduon commented 3 years ago

Thank you so much!

I will read more about this before the implementation.

omertov commented 3 years ago

Cool! After further looking into the appendix I saw that:

In total, for a 1024x1024 generator, there are 6048 style channels that control feature maps, and 3040 additional channels that control the tRGB blocks.

So in total the size of the StyleSpace S is 6048 + 3040 = 9088 , which is slightly less then the 18 x 512 = 9216 parameters of the "W+" space.

This means that dividing some codes obtained from the current network architecture might work, although modifying the amount and the shape of the "map2style" layers output is also possible :).

The layer codes division should match the #channels described in the following table from the StyleSpace paper (Meaning there are 26 codes instead of 18, and that the first code is of size 512, while the last is of size 32 etc.):

Best, Omer