Questions about Glow implementation (Not bugs)

geosada commented 4 years ago

Hi Krzysztof,

I'm studying Glow with your code and being confused about y and z, please let me ask some questions.

Q.1 My understanding is as follows:

y and z come from multiscale architecture.
Suppose forward=True, the dimension of z gets increased by copying the half of the input(=x) at every split (i.e., FactorOutLayer).
Whereas y is the rest of the input(=x) after the above split, and y will be the input for the next flow (specifically, for squeeze).

So eventually, when we take the latent representation for a particular x, it should be concate([z,y]), am I correct?

Q.2 I want to know well about the below that is from your slide:

For p(y),

we want to y keep the information about the image

For p(z),

we want to p(z) keep the noise,

It means that y will be input to the next flow, so p(y) is possibly still a flexible form, i.e., not Gaussianized yet. The form of non-Gaussian can be interpreted as "keep the information about the image". Am I correct?

In addition, regarding the bellow:

For p(z),

we could train model with different penalties for p(y) and p(z)

What does it mean? Specifically, which line in the code does correspond to that? I'm afraid of asking many questions, but I'm looking forward to hearing from you.

kmkolasinski commented 4 years ago

Hi @geosada , I can try to answer your questions, but please note that I'm not a full time scientist, so I might be wrong at some statements.

Regarding your questions:

Q1: basically in the NFs approach you cannot lose any information during processing in order to deal with invertible mappings, so indeed the total size of (y and z) tensors will be exactly the same as in the input image x. However from my small experience I got the impression that y contains more structured information about the image than z. I'm not sure if there is some mathematical prove for that observation and I'm not aware of any research which would try to investigate this problem. In practice you should use a concate([z,y]) as your latent code, but probably y will have bigger effect on the final reconstructions, interpolations etc. Since we are talking about deep learning, I think the best solution for you is to check this experimentally on your own data.

Q2: I think this is the nice way of looking at this problem. We have some high dimensional image, and we would like to reduce its dimensionality and separate it into two vectors: y - a small vector but with useful information inside and z - a much larger vector which will contain some noise and unstructured information (Gaussianized much earlier). I agree with your statement then.

Regarding last question. Few months ago I got the idea that maybe there is some way to force z to keep as much noise as possible (or as little information), such that if we will remove it from the latent representation we will be able to do decent reconstructions. You can try to force the network to do so using following approach (written in pseudo-code):

y, logdet, z = flow(x, forward=True)
# remove z and try to force network to store all information about the image in y,
# so that z will be not used in practice
z *= 0  
x_reconstructed, *_ = flow((y, logdet, z), forward=False)
reconstruction_loss = mse(x_reconstructed - x)
loss = log_prob_y  + logdet
mle_loss = - tf.reduce_mean(loss)
total_loss = mle_loss + reconstruction_loss

However as far as I remember, I couldn't make it work. Probably because of exploding gradients or I did something wrong. Second approach: You can try to inject random noise into z and force network to ignore this output during backward flow e.g.

y, logdet, z = flow(x, forward=True)
# add noise to z to try to force network ignore z content
z += random_gaussian() 
x_reconstructed, *_ = flow((y, logdet, z), forward=False)
...

I didn't try this approach. Third approach could try to minimize mutual information between z and x and maximize it between y and x. This information theory based approach is probably very interesting direction. Unfortunately for this idea I'm not sure if there exists some easy to compute mathematical formula. You can google some (lower/upper bounds) approximations for mutual information e.g. InfoGAN paper or for discrete variables the "Invariant Information Clustering for Unsupervised Image Classification and Segmentation" paper.

Here you can find how to use my small library for NFs: https://github.com/kmkolasinski/deep-learning-notes/tree/master/seminars/2018-10-Normalizing-Flows-NICE-RealNVP-GLOW/notebooks#a-normalizing-flow-high-level-library

Hope this helps, best regards, Krzysztof

geosada commented 4 years ago

Hi Krzysztof,

Thank you so much for your reply and sharing your idea with us. Now I got why you examined with treating y and z separately.

Actually your observation was quite curious because I’m thinking of the possibility of dimension reduction with NFs. Both approaches you suggested regarding Q2 were interesting, I’ll try with your code.

Originally, I was interested in your materials for Neural ODE (which also looks marvelous!), but before move on that I’ll stay here a little while to study Glow.

Thank you Krzysztof again, best regards, geosada

geosada commented 4 years ago

Sorry Krzysztof, let me ask one more thing, the bellow code in nets.py at line 181 shift = shift_log_scale[:, :, :, 0::2]

should it be as bellow? shift = shift_log_scale[:, :, :, 0::1]

Best regards,

kmkolasinski commented 4 years ago

Thank you for feedback and good luck with your research! You can send me some feedback If you will manage to achieve some interesting results.

PS: Tensorflow probability package has quite a lot implementations of different bijectors: https://www.tensorflow.org/probability/api_docs/python/tfp/bijectors

Regarding your last question, my implementation seems to be fine:

        def _shift_and_log_scale_fn(x: tf.Tensor):
            shape = K.int_shape(x)
            num_channels = shape[3]

            with tf.variable_scope("BlockNN"):
                h = x
                h = self.activation_fn(ops.conv2d("l_1", h, self.width))
                h = self.activation_fn(
                    ops.conv2d("l_2", h, self.width, filter_size=[1, 1]))
                # create shift and log_scale with zero initialization
                shift_log_scale = ops.conv2d_zeros(
                    "l_last", h, 2 * num_channels
                )
                shift = shift_log_scale[:, :, :, 0::2]
                log_scale = shift_log_scale[:, :, :, 1::2]
                log_scale = tf.clip_by_value(log_scale, -15.0, 15.0)
                return shift, log_scale

Note that: shift_log_scale is a tensor of shape [batch_size, height, width, 2 * num_channels], then I split into two tensors shift = shift_log_scale[:, :, :, 0::2] and log_scale = shift_log_scale[:, :, :, 1::2]. In python 0::2 slice operator takes all even channels indices e.g. [0, 2, 4, 6, ...] and 1::2 takes all odd channels indices [1, 3, 5, ...]. This is standard trick introduced in the RealNVP paper or earlier (I don't remember now) which allows to create invertible non-linear bijector. This operation is related to: https://arxiv.org/pdf/1605.08803.pdf (equation 5). On the order hand, this part of code corresponds to equation from GLOW paper: From Table 1. of GLOW paper:

kmkolasinski commented 4 years ago

You may find this paper interesting: https://arxiv.org/abs/1905.07376

geosada commented 4 years ago

Ah…, sorry Krzysztof, it’s my misunderstanding, thank you for a clear explanation. I thought the code was simply going to be split into two parts in channel, like [:,:,:, 0:1] and [:,:,:, 1:2].

Let me confirm to make sure, y and z that used in your experiments of interpolation with varying temperature and t-SNE plot are the final outputs of forwarding flow? I mean, you used the following y (24 channels) and z (168 channels) for example in Celeba48x48_22steps.ipynb, In [12]: y, logdet, z = output_flow Am I correct? sorry for asking many times.

By the way, the paper you recommended is worth checking and exactly interest of mine, I’ll read it, thank you.

Best regards, geosada

kmkolasinski commented 4 years ago

Yes that's correct. The z and y vectors are for sure the outputs from the forward flow.

geosada commented 4 years ago

Okay, thank you for answering quickly. Best regards.

kmkolasinski / deep-learning-notes

Questions about Glow implementation (Not bugs) #17