Too many Conv3d layers for HierarchicalPixelSNAIL.condition_bottom?

Hi, thanks for the codebase.

I was wondering: the paper says the bottom latent code has dimension 64x64x8, and as I understand this is passed to HierarchicalPixelSNAIL.condition_bottom in the code. However, this is an nn.Sequential which has 4 nn.Conv3d layers with stride (2,1,1), so it divides the size of the time dimension by 2 4 times. This gives me an error at the 4th application: Calculated padded input size per channel: (3 x 66 x 66). Kernel size: (4 x 3 x 3). Kernel size can't be greater than actual input size. So should there be one fewer nn.Conv3d layer in HierarchicalPixelSNAIL.condition_bottom? Or am I missing something?

Thanks again!

mattiasxu / Video-VQVAE

Too many Conv3d layers for HierarchicalPixelSNAIL.condition_bottom? #5