janericlenssen / group_equivariant_capsules_pytorch

Pytorch implementation of Group Equivariant Capsule Networks
29 stars 8 forks source link

space_to_depth function meaning? #3

Open MiZhangWhuer opened 5 years ago

MiZhangWhuer commented 5 years ago

Hi, thanks for sharing this wonderful repo. I found that there is a function space_to_depth defined as follows:

def space_to_depth(input, block_size): block_size = int(block_size) block_size_sq = block_size block_size output = input (batch_size, s_height, s_width, s_depth, s_posev) = output.size() d_depth = s_depth block_size_sq d_height = int((s_height + (block_size - 1)) / block_size) t_1 = output.split(block_size, 2) stack = [ t_t.contiguous().view(batch_size, d_height, 1, d_depth, s_posev) for t_t in t_1 ] output = torch.cat(stack, 2) return output

Wha's the exact meaning of this function to process pose and aggrement in the forward pass?

def forward(self, x, a, pose, size):
    pooled_size = (size[0], int((size[1] + 1) / (self.pool_length)),
                   int((size[2] + 1) / (self.pool_length)))

    a = a.view(*size, self.in_channels)
    pose = pose.view(*size, self.in_channels, 2)

    a = space_to_depth(a.unsqueeze(-1), self.pool_length).squeeze(-1)
    pose = space_to_depth(pose, self.pool_length)

    pose = pose.view(*pooled_size, self.pool_size, self.in_channels, 2)
    a = a.view(*pooled_size, self.pool_size, self.in_channels)
janericlenssen commented 5 years ago

Hi, The function moves all input capsules (poses + agreements) of one receptive field into an additional depth dimension of the array so that afterwards receptive field aggregation can be done by performing a reduce operation over that depth dimension.

An example application for max pooling (without capsules) would be: 1) [N, Y, X, C] ---(space_to_depth)--> [N, Y/2, X/2, 4, C] 2) result = tensor.max(3)

(note that this is not only reshaping/viewing because memory layout changes)

It should be noted that the function provided in the repo currently only works if the X/Y sizes are dividable by the pool_length. There currently is no padding for cases where dimensions don't match.

MiZhangWhuer commented 5 years ago

Hi, The function moves all input capsules (poses + agreements) of one receptive field into an additional depth dimension of the array so that afterwards receptive field aggregation can be done by performing a reduce operation over that depth dimension.

An example application for max pooling (without capsules) would be:

  1. [N, Y, X, C] ---(space_to_depth)--> [N, Y/2, X/2, 4, C]
  2. result = tensor.max(3)

(note that this is not only reshaping/viewing because memory layout changes)

It should be noted that the function provided in the repo currently only works if the X/Y sizes are dividable by the pool_length. There currently is no padding for cases where dimensions don't match.

Thanks for your repaid reply. By the way, is there any example for larger dataset validation, such as CIFAR? I have tried to apply the approach with lager input, i.e. 256x256, but it failed due to input and output elements. Here is what I have changed in the source code:

  1. Inputs remain MNIST with padding size of 114. That is to say, input size is 256 x 256.
    for i, (img_batch, target) in enumerate(train_loader): img_batch_28, target = img_batch.to(device), target.to(device) img_batch = torch.nn.ZeroPad2d(114)(img_batch_28) img_batch = img_batch.squeeze(1).unsqueeze(-1) optimizer.zero_grad()
  2. grid size begin with 256x256 and decrease by factor 2. device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') (e1, p1), c1 = grid(256, 256, device=device), grid_cluster(256, 256, 2, device) (e2, p2), c2 = grid(128, 128, device=device), grid_cluster(128, 128, 2, device) (e3, p3), c3 = grid(64, 64, device=device), grid_cluster(64, 64, 2, device) (e4, p4), c4 = grid(32, 32, device=device), grid_cluster(32, 32, 2, device) (e5, p5), c5 = grid(16, 16, device=device), grid_cluster(16, 16, 2, device)

My entire codes are pasted in attachment mnist.py.txt

(I don't make any change to others except for minist.py) Do you have any suggestion regarding to modify the code to larger inputs?

janericlenssen commented 5 years ago

Hi, We have not run experiments on such large input images yet. The architecture you proposed should run into problems with number of output capsules since you do not reduce the spatial dimensions to 1 x 1 in the end. You need to either increase the pooling receptive field, or the number of layers, or come up with a global aggregation layer in the end that maintains equivariance.