pbaylies / stylegan-encoder

StyleGAN Encoder - converts real images to latent space
Other
740 stars 180 forks source link

reasons for having 18 (8) identical dlat vectors #2

Closed oneiroid closed 5 years ago

oneiroid commented 5 years ago

Hey, very cool implementation, respect for masking. I think i got it why them karras et al map lats to 18 identical dlat vectors, wondering if valid. 10 out of 18 are for noise, so we have 8 dlats. if imagine that each dlat is 1D then stylegan maps ffhq faces space to the line in 8D space which has 45° angle with all axes. so by forcing all ffhq faces embeds to this line they make stylegan to learn the whole 8D space yo. wondering if this means that we should truncate not not avg dlat, but with dlat that rpresents the closest point on that line.

pbaylies commented 5 years ago

Hi @oneiroid -- glad you like the implementation! You might be interested in this paper from NVIDIA, where they come up with a realism score based on how close a point is to the manifold. I'm not convinced that what I'm doing is the best solution -- it seems to work well in practice, but I am open to suggestions, and code!

oneiroid commented 5 years ago

lerp_ka_trunk_sing lerp_ka_trunk_avg

Thnx for the paper! But they seem not be dealing with non-identical dlatents. What I'm saying is that their manifold is a line in 8 or 18-dimensional space. By changing some of those 8 dlat 512-element vectors the face point in latent space moves away from that line - but that's ok, manifold is forced into line intentionally. My suggestion is to use closest point on that line for dlatents truncation instead of average dlatent. Using average dlatent is like moving towards the center point of that line - causes observed space distortions... I've attached grids with "disgust" emotion shift with factor range [-3, 3] - the first using closest point, second - using average dlat. Reference img below. kaza_WA0005_sing

pbaylies commented 5 years ago

Thanks for the image, that does seem to result in better interpolations! Do you think this would improve training speed? If you can provide some code, I can test it out.

oneiroid commented 5 years ago

all the code is yours ))) i just first run encode_images.py with tile_dlatents. then use them in truncate func.

`def truncate_fast(dlat, dlat_avg, truncation_psi=0.7, minlayer=0, maxlayer=8, do_clip=False): layer_idx = np.arange(18)[np.newaxis, :, np.newaxis] ones = np.ones(layer_idx.shape, dtype=np.float32) coefs = np.where(layer_idx < maxlayer, truncation_psi * ones, ones) if minlayer > 0: coefs[0, :minlayer, :] = ones[0, :minlayer, :] if do_clip: return tflib.lerp_clip(dlat_avg, dlat, coefs).eval() else: return tflib.lerp(dlat_avg, dlat, coefs)

dlats_mod = truncate_fast(dlats_mod, dlat_singular, truncation_psi=0.7, maxlayer=8, do_clip=True)`

pbaylies commented 5 years ago

@oneiroid this looks good to me; I pushed an update and added an option to effnet_train.py -- take a look?

oneiroid commented 5 years ago

ehh, nah, sorry bro, i failed to explain properly... i meant to completely replace dlatent_avg with dlatent vector (1, 512), that is to be found by running your encode script with --tile_dlatents=True.

This has to be done for each image that we want to encode (or once for each person).

Then, during final encoding (and for learning directions in latent space, interpolations etc.) all truncation should be done using this pre-found dlatent INSTEAD of dlatent_avg that we fetch from Gs own vars.

Will try to finish and share code when get home

pbaylies commented 5 years ago

@oneiroid Please do! That makes a bit more sense. I think the easiest way to do that would be to have an option for a numpy array to load to replace dlatent_avg. I think the only place I'm using that in the encoder is for the L1 penalty.

oneiroid commented 5 years ago

done. Gonna try to implement dynamic dlatent_avg for truncation (or clipping). Wonder if it can be found using linalg methods for finding the closest point on manifold axis. Is it possible for 8D space with each dimension represented by 512 floats - i mean, find closest point on a line from some reference point? Also, you use stochastic clipping - is it better than truncating (like here: halcy?) I've noticed couple of times that encoding process resets dlatent variable mid-way when l1 penalty not used...

pbaylies commented 5 years ago

@oneiroid merged! I got the stochastic clipping from @Pender and his branch -- https://github.com/pender/stylegan-encoder -- it really does seem to speed things up and stop the latents from getting too far away from the model's representations.

pender commented 5 years ago

Here's the paper that proposes stochastic clipping.

oneiroid commented 5 years ago

thnx! this clipping though - if it's purpose is to keep dlats from going extreme - then it seems kinda overkill - information gets lost in randomness... truncating learnabledlatents and your l1 penalty seem to be enough, but that i need to check **

pbaylies commented 5 years ago

@oneiroid that's what I thought at first too, I think it's a bit counterintuitive at first. But you can clearly see the effect from the training videos. Basically, stochastic clipping keeps the representation simple, so it short-circuits the optimization from getting too complex and finding a local minima somewhere else, or wasting time looking for one. We already know that all the good learned representations are closer to the center of the parameter space, so this keeps the search focused there.

P.S. Maybe you could find something that looks better on the surface if you searched further out, but interpolation etc. would be broken because at that point you're well outside the learned representation of the model.

pender commented 5 years ago

I think the intuition (which I'm getting from the paper linked above) is that if you truncate the dlatents (either with the hard bound of a clipping function or the soft bound of L2 loss or similar), you will end up with dlatents that have an abnormal proportion of their components pinned against the bound... so even though each component individually will still be within n standard deviations of the average, the dlatent as a whole will be statistically very different from average. Stochastic clipping prevents that altogether.

But now I'm wondering, should we be stochastically clipping values to keep them close to dlatent_avg, rather than close to zero? I.e. instead of this:

clipping_mask = tf.math.logical_or(self.dlatent_variable > 2.0, self.dlatent_variable < -2.0)
clipped_values = tf.where(clipping_mask, tf.random_normal(shape=self.dlatent_variable.shape), self.dlatent_variable)

maybe it should be this:

dlatent_avg = Gs.get_var('dlatent_avg')
dlatent_dist_from_avg = dlatent_avg - self.dlatent_variable
clipping_mask = tf.math.logical_or(dlatent_dist_from_avg > 2.0, dlatent_dist_from_avg < -2.0)
clipped_values = tf.where(clipping_mask, tf.random_normal(shape=self.dlatent_variable.shape) + dlantent_avg, self.dlatent_variable)

Does dlatent_avg effectively measure the bias of the mapping network -- the mean around which it maps Z-vectors (latents rather than dlatents) that are normally distributed around zero? I guess an even more principled approach would be to backpropagate all the way through the mapping network to optimize the original Z-vector, and then perform stochastic clipping on that Z-vector based on its distance from zero. Doubt it would be worth the extra fragility in the optimization by propagating all the way through the mapping network though.

I also haven't played around with the boundary itself. Is 2 standard deviations the right threshold for clipping? Not sure, and I haven't done much testing... the paper was written with respect to a GAN where the latent variables were uniformly distributed in [-1, 1] rather than normally distributed, so it was easy for them to decide to clip at a distance of 1.

pbaylies commented 5 years ago

@pender I think you might be right that starting distributed around the average would be better; but I don't know if it would make that much of a difference in practice -- a good choice for a value will quickly converge, a bad choice might just get clipped again (monte carlo method!). I have played with the threshold, I think in practice [-2, 2] is fine, it seems big enough to allow for quite a bit without getting too crazy. It should be easy to experiment though, since I have a flag for it, and of course training videos...

oneiroid commented 5 years ago

@pbaylies you are right. my stupid, i was wrong - haven't read original paper thoroughly. The learned manifold is not a line, they do style mixing during training stylegan. The fact that samples from W+ (18 vectors) are always a set of only 2 unique (1,512) W vectors means something, but how to use it i dunno, if I finally got that right))

@pender this looks similar to what @halcy does:

def create_variable_for_generator(name, batch_size): 
     truncation_psi_encode = 0.7
    layer_idx = np.arange(16)[np.newaxis, :, np.newaxis]
    ones = np.ones(layer_idx.shape, dtype=np.float32)
    coefs = tf.where(layer_idx < 8, truncation_psi_encode * ones, ones)
    dlatent_variable = tf.get_variable(
        'learnable_dlatents', 
        shape=(1, 16, 512), 
        dtype='float32', 
        initializer=tf.initializers.zeros()zero
    )
    dlatent_variable_trunc = tflib.lerp(dlatent_avg, dlatent_variable, coefs)
    return dlatent_variable_trunc

But you'll need to use some truncation when generate face from them. I used tflib.lerp_clip - and resulting dlatents were inside the manifold. But isn't forcing dlatents to group around dlatent_average is counter-productive, we should be forcing them to group around the bias point of our target face image, not universal average. Pls tell if i am still not getting something...

pbaylies commented 5 years ago

@oneiroid so the interesting thing about doing style mixing at a random point during training is that they also do progressive growing during training -- hence the full range of coarse to fine dlatents in the end, so it isn't two unique vectors either! It's definitely possible to optimize with less than the full set of dlatents, but by default I just use the full set.

pender commented 5 years ago

@oneiroid

@pender this looks similar to what @halcy does:...

That does not look like stochastic clipping, it looks like it's taking the first eight layers 30% of the way toward the average.

But isn't forcing dlatents to group around dlatent_average is counter-productive, we should be forcing them to group around the bias point of our target face image, not universal average. Pls tell if i am still not getting something...

I'm not sure what you mean by "the bias point of our target face image." Encoding is necessary because we don't know which dlatents precisely encode the target face image. The purpose of biasing the result toward the dlatent average is that most randomly generated dlatents (random unit-normal zero-centered Z mapped to dlatent space by the mapping network) cluster near that average, so the synthesis net is more likely to do a good job near that average since that is the domain on which it has primarily been trained.

halcy commented 5 years ago

fyi: The code I use for truncation is doing (or should be doing) exactly the same thing as the original truncation during generation - it's interpolating dlatents for all layers above a certain point towards the "average" dlatent vector. This is not neccesarily good or ideal for encoding, but seemed to make sense to have the generation part be the same in encoding as it would be doing during sampling later. I put it in because my encoding attempts tended to end up looking like the source image but with a dlatent representation that's not like the sampled dlatents at all. A loss based on the dlatents distance (according to some metric) to the mean dlatent vector might be better.

oneiroid commented 5 years ago

The purpose of biasing the result toward the dlatent average is that most randomly generated dlatents (random unit-normal zero-centered Z mapped to dlatent space by the mapping network) cluster near that average, so the synthesis net is more likely to do a good job near that average

right, i get that. nevertheless, my point still stands: when we search for some img dlatents - we suppose that for that img (or face) there are some perfect set of 18 dlatents (better if identical) that is the dlatent_avg with respect to that img/face. it should be "standard" (1,512) dlatent vector, and may produce a face quite different from target img. anyways, effect of it is miserable compared to current results.

pbaylies commented 5 years ago

Ok; good discussion, I think we've hashed some things out. I'm going to close this issue; feel free to open any more specific issues if you have an idea, or code. :)