Fidelity to the VQ-VAE-2 paper

Hi, I am trying to build a 2 stage VQ-VAE-2 + PixelCNN as shown in the paper: "Generating Diverse High-Fidelity Images with VQ-VAE-2" (https://arxiv.org/pdf/1906.00446.pdf). I have 3 implementation questions:

The paper mentions: "We allow each level in the hierarchy to separately depend on pixels". I understand the second latent space in the VQ-VAE-2 must be conditioned on a concatenation of the 1st latent space and a downsampled version of the image. However here I see the 2nd latent space is only conditioned on the 1st one. Why ? See: https://github.com/rosinality/vq-vae-2-pytorch/blob/master/vqvae.py#L199
There is no conditional implementation for the PixelCNN here. The paper "Conditional Image Generation with PixelCNN Decoders" (https://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf) says: "h is a one-hot encoding that specifies a class this is equivalent to adding a class dependent bias at every layer". As I understand it, the condition is entered as a 1D tensor that is injected into the bias through a convolution. Now for a 2 stage conditional PixelCNN, one needs to condition on the class vector but also on the latent code of the previous stage. A possibility I see is to append them and feed a 3D tensor. How would you insert both of those conditions in the PixelCNN architecture ?
The loss and optimization are unchanged in 2 stages. One simply adds the loss of each stage into a final loss that is optimized. Is that right ?

rosinality / vq-vae-2-pytorch

Fidelity to the VQ-VAE-2 paper #32