Seperate training for bottom and prior model

I've been working on this model, and the pixel snail training code didn't converge.

I found out that the problem was that the original code was training the top and bottom prior models simultaneously, whereas the paper trained the two models in seperate.

Changing the order of the top and bottom generation can ensure that the gradients of the two models flow seperately and helps the model converge.