mattiasxu / Video-VQVAE

VQVAE for video prediction
MIT License
26 stars 7 forks source link

Too many Conv3d layers for HierarchicalPixelSNAIL.condition_bottom? #5

Closed nathancornille closed 2 years ago

nathancornille commented 2 years ago

Hi, thanks for the codebase.

I was wondering: the paper says the bottom latent code has dimension 64x64x8, and as I understand this is passed to HierarchicalPixelSNAIL.condition_bottom in the code. However, this is an nn.Sequential which has 4 nn.Conv3d layers with stride (2,1,1), so it divides the size of the time dimension by 2 4 times. This gives me an error at the 4th application: Calculated padded input size per channel: (3 x 66 x 66). Kernel size: (4 x 3 x 3). Kernel size can't be greater than actual input size. So should there be one fewer nn.Conv3d layer in HierarchicalPixelSNAIL.condition_bottom? Or am I missing something?

Thanks again!

mattiasxu commented 2 years ago

You're correct. I'm still working on the pixelsnail and I can't make it converge, so sorry for that