[Question] Squeeze and Split flows in Tutorial 11

Tutorial: 11

Describe the bug Not a bug, but instead some questions regarding the exact implementation of the Squeeze and Split flows. I somehow managed to dig up the official implementation of the realnvp model in the tensorflow archives (here) and there are some differences. Not sure if these differences are actually relevant, but still would be happy to discuss them.

To Reproduce (if any steps necessary) Squeeze In notebook 11, cell 18 the reshape for Squeeze is implemented as:

z = z.permute(0, 1, 3, 5, 2, 4)

But here they implement it somewhat different. Note that this is in tensorflow and the image dims are (H, W, C), i.e. channels last. The equivalent in pytorch would be:

z = z.permute(0, 3, 5, 1, 2, 4)

The difference is that your spatial dimensions would be intermixed with your channel dimensions.

Split The second question is regarding splitting in multi-scale architectures. You can see here that after they squeeze and do the channelwise coupling they perform an unsqueeze and then squeeze again but using a different pattern. Now, the squeeze_2x2_ordered .... I really don't know why it is written this way. Essentially what it does is:

z = z.reshape(N, C, H//2, 2, W//2, 2)
z = z.permute(0, 3, 5, 1, 2, 4)
on = torch.stack((z[:, 0, 0], z[:, 1, 1]))
off = torch.stack((z[:, 0, 1], z[:, 1, 0]))

So if you take a look at the squeeze_operation.svg image, instead of keeping the first two channels and evaluating the last two channels, you would keep the first and the last channel and evaluate the middle two.

So for both the Squeeze and Split I am wondering does it really matter if we do it one way or the other. And what was their motivation for doing it in such a complicated way?

Channelwise My final question is regarding the channelwise coupling layer. As presented in the paper, this transformation of spatial dimensions to channel dimensions seems somewhat redundant. To me it looks like we could achieve the exact same result by doing a row-wise coupling, so what is the point, am I missing something ?

x = torch.rand((1, 3, 32, 32))
N, C, H, W = x.shape
network = lambda x: torch.hstack((x, x)) # identity function
channelwise = lambda C: (torch.arange(C) % 2).reshape(C, 1, 1)
rowwise = lambda H: (torch.arange(H) % 2).reshape(1, H, 1)

flow1 = [
    SqueezeFlow(),
    CouplingLayer(network, channelwise(4*C), c_in=1),
    CouplingLayer(network, 1 - channelwise(4*C), c_in=1),
    CouplingLayer(network, channelwise(4*C), c_in=1),
]
S = SqueezeFlow()

z1, idj = x, 0
for f in flow1:
    z1, idj = f(z1, idj)
z1, _ = S(z1, idj, reverse=True)

flow2 = [
    CouplingLayer(network, rowwise(H), c_in=1),
    CouplingLayer(network, 1 - rowwise(H), c_in=1),
    CouplingLayer(network, rowwise(H), c_in=1),
]
z2, idj = x, 0
for f in flow2:
    z2, idj = f(z2, idj)

print((z1==z2).all())

Additional context I tried to keep it small and simple. I hope the questions make sense. Anyway, I love the content! It was really helpful! Thanks a lot for sharing : )

Hi, thanks for raising this comparison, I didn't expect that there would be these differences! Regarding the split and squeeze: in general, either way (official implementation or notebook) is fine in theory, although I would expect the version in the tutorial to generalize a bit better. Splitting over pixel positions (in the official code) instead of channels (in the tutorial) seems to me much harder to learn, since you force the network to identify which pixels will be mapped to the prior at the early stage already, while the neural networks are convolutional and thus translation invariant (up to padding borders). So in the end, the model might have to map everything to the prior distribution, which can limit the capacity. In comparison, if you split over channels, it is much easier for the model to simply dedicate some dimensions to be mapped to Gaussians, especially since we use a similar splitting strategy in the coupling layers. Nonetheless, I have not explicitly tested it and thought the tutorial version was the more natural way.

For the coupling layers, the difference in channelwise versus rowwise coupling is how the network perceives the input. In channelwise, we have an input of 4x14x14, while the rowwise coupling has 1x28x28. Thus, different network architectures are learned here. Another key difference is when you start combining these layers with more sophisticated NF tricks. For instance, often invertible 1x1 convolutions are used to intermix the channels between coupling layers. This has a different effect if you squeeze or not.

Hope that helps :)

phlippe / uvadlc_notebooks

[Question] Squeeze and Split flows in Tutorial 11 #78