Inverse network output shape - Githubissues

pbaylies / stylegan-encoder

StyleGAN Encoder - converts real images to latent space

Other

740 stars 182 forks source link

Inverse network output shape #1

Closed pender closed 5 years ago

pender commented 5 years ago

Picking up from here...

My understanding of the code in train_effnet.py is that you generate a training set in which the targets are the dlatent outputs of the StyleGAN mapping network, and the inputs are the images synthesized from those dlatents with the StyleGAN synthesis network.

The thing that confuses me is that the StyleGAN mapping network outputs a single [1, 512] vector that is then tiled up to [18, 512], so that all 18 layers are identical. But the effnet's architecture doesn't constrain its output similarly. It outputs a [18, 512] vector in which the layers don't seem to be constrained to be identical to one another, and in practice it doesn't learn to do so. (Example: Target image, composite image that it generates, and each of the 18 layers synthesized individually)

Am I understanding it correctly? If so, wouldn't you normally constrain the architecture of a network to the same rough domain as the targets in the training set? For example, if you were training a GAN with a 512x512 grayscale training set, wouldn't you set its output to 512x512, and not 512x512x3?

pbaylies commented 5 years ago

Hi @pender - my goal with this encoder is to be able to encode faces well into the latent space of the model; by constraining the total output to [1, 512] it's tougher to get a very realistic face without artifacts. Because the dlatents run from coarse to fine, it's possible to mix them for more variation and finer control over details, which NVIDIA does in the original paper. In my experience, an encoder trained like this does a good job of generating smooth and realistic faces, with less artifacts than the original generator.

I am open to having [1, 512] as an option for building a model, but not as the only option, because I don't believe it will ultimately perform as well for encoding as using the entire latent space -- but it will surely train faster!

pender commented 5 years ago

@pbaylies - Right, I totally get that point with respect to the encoder, which optimizes all 18 layers of the dlatent tensor based on the perceptual loss between the generated image and the original image. My question is specifically regarding the inverse network that you can train via train_effnet.py or train_resnet.py. When you train that network, its training target are exclusively outputs of the StyleGAN mapping network, so it is training to match dlatent tensors where all 18 rows are the same for each data point. In other words, it never receives a training signal that would allow the difference between the 18 layers to be meaningful, so any failure to be the same would just be noise from the training. Am I misunderstanding how the training works?

pbaylies commented 5 years ago

@pender -- so take a close look at where I generate the dataset in generate_dataset_main(), I purposely get more values from the mapping network and mix them in to generate more diverse faces. That's what all the mod_l / mod_r stuff relates to -- generating more diverse dlatents for training.

pender commented 5 years ago

aha! Thank you for clarifying. I had a suspicion I might have this wrong when the composite face looked closer to the original than the faces generated by the individual layers.

pbaylies commented 5 years ago

@pender Cheers; I should probably document that section better / at all... :)

pender commented 5 years ago

Hi @pbaylies, would it be a lot of work to add a flag (or to just indicate to me how) to build an effnet to output a [1, 512] dlatent? I've been staring at the effnet code for a while and I'm not sure how to do it. I can handle changing the assembly of training data but would much appreciate a pointer in correctly tweaking the effnet's architecture itself if you have a minute or two.

pbaylies commented 5 years ago

Hi @pender -- I had tested out a simplified version of this code for that purpose, I'll post something for you tomorrow!

pbaylies commented 5 years ago

Here you go @pender -- see if this works for you!

train_eff512.py.zip

pender commented 5 years ago

Terrific -- thank you so much!