What is the training process?

XiaoHao-Chen commented 5 years ago

Hello, I want to ask you a question about the training and testing of pixelcnn. In the training process, a batch of images are sent in, and the probability density of the pixels is estimated by the network. Then what? How does it generate new images through these probability densities? I haven't been able to understand the specific training process on this point.I didn't find out how it was trained in paper. If you can, please let me know. Thanks very much.

pclucas14 commented 5 years ago

Hi @331801070049,

it seem your question is pretty general (and not specific to pixel-cnn). Every conditional probability is estimated by the network. The network is then trained to change its parameters such that the seen pixels are likely under the probability distributions emitted by the network.

If you have more specific questions I will be happy to help.

-Lucas

XiaoHao-Chen commented 5 years ago

您好@ 331801070049，

看起来你的问题很普遍（而不是像pixel-cnn那样具体）。每个条件概率都由网络估算。然后训练网络以改变其参数，使得所看到的像素可能在网络发射的概率分布之下。

如果您有更具体的问题，我将很乐意为您提供帮助。

-Lucas

Thank you for your reply. I may not have a good interpretation of my question, but my main question now is:

What is the main framework of pixelcnn? What kind of architecture is its network graphics? Is it like an encoder, a decoder drawn in the pixelcnn++ article? Does the depth of the network vary according to the different data sets, or does it have a unified structure?
I haven't been able to figure out what its loss function is. Is it negative logarithmic likelihood?
In the training process, the encoder network encodes the image into a feature. At this time, the network learns the parameters, and then the parameters are used in decoder network to generate new pictures. Is my explanation correct?
Since for the same picture, such as a portrait. After training, the parameters are fixed, so why can we generate different images under different expressions? In the case of the same input, the parameters of online learning should be the same. Should not the generated pictures be the same? Thank you very much for your help! Best wishes for you！

XiaoHao-Chen commented 5 years ago

In addition, if I want to train my own data set, how should I set the proportion of training and testing? The size of the data I want to train is 256*256. Will this be affected? Thank you very much for your help! Best wishes for you！

pclucas14 commented 5 years ago

The architecture is desribed in figure 2 of the paper.
The loss function is negative log likelihood, as is commonly done when training deep neural nets. It's unclear how the architecture should change for other datasets, as the original paper only trains on CIFAR-10.
the encoder and decoder parameters are learned jointly. During training, images are never generated --> when the network predicts pixel (i,j), it has access to the real image's pixels that are before (i,j).
Every conditional distribution, is, well a distribution. you can sample from this distribution, and everytime you sample you will can get different results.

As for your last comment, 256x256 images are much bigger than 32 x 32, so you should expect training to be very slow at this resolution. What is your use case ?

XiaoHao-Chen commented 5 years ago

Thank you very much for your reply, which is very helpful to me.

What you said is that no image is generated in training, and the criterion for training is that the negative logarithmic likelihood becomes the smallest. Then in the testing process, the trained parameters are still used to sample the image pixels, and then the new image is generated according to the probability distribution calculated by the network. Is that so?
I want to use UCMerced_LandUse data set of remote sensing image to train pixel cnn. The preliminary idea is to add a convolution network before the pixel CNN network and convolute the image size into 28*28. But I haven't tried yet. Do you think it's feasible? Thank you very much for your help!

XiaoHao-Chen commented 5 years ago

Did you say that the predicted pixels are only in the testing process? And when predicting (i, j), you say it can access the pixels of the real image. Shouldn't it be a predicted pixel before accessing it? I haven't been able to figure out how to generate an image. All I know now is that it generates pixels by pixels. Can you give me more details about the process of generating images?

pclucas14 commented 5 years ago

To answer the first part of your question, the model is "teacher forced' during training. During sampling however, pixels are generated sequentially. The code for sampling can be found here.

pclucas14 commented 5 years ago

Say our image is a 3 x 3 grid, and thus contains 9 pixels. Your goal is to predict the joint distribution, i.e. p(x1, x2, ..., x9). Autoregressive models, such as pixel cnn, break up the joint distribution as a product of conditions (as in done in NLP models like LSTMs or Transformers): p(x1, x2, .., x9) = p(x1) p(x2 | x1) p(x3 | x2,x1) ... p(x9 | x8,x7,...x1).

It's important to ensure that every conditional distribution e.g. p(x3|x2,x1), cannot look at its target value == x3. In order to achieve this, the model uses masked convolutions (you can look here for more info)

Now, when you are training, you already have x1, ... , x9, so you can compute all the conditionals at the same time. During sampling however, since x2 depends on x1, and x3 depends on x2, you need to sample every pixel at a time (this cannot be parallelized). This iteration over pixels corresponds to the for loop in the sample method I linked above.

Hope this helps, Lucas

rsbandhu commented 4 years ago

Hi Lucas, could you please point where in the code you are enforcing teacher during training?

Perry-Zhang-pyz commented 4 years ago

when i train the pixelcnn model, it can output a image like similar to the input, but when I try to generate a new image, most of the pixel values are same. so what's wrong with my train or generate process

pclucas14 / pixel-cnn-pp

What is the training process? #12