Hi, I got a question regarding the input_features data for PoseDecoder network.
From the line below, the PoseDecoder accepts an input feature with number of channels equal to self.num_ch_enc[-1], which according to the ResnetMultiImageInput encoder, should be 512.
self.convs[("squeeze")] = nn.Conv2d(self.num_ch_enc[-1], 256, 1)
However, the output features of the ResnetEncoder have the following shapes, which means that only the last element of the features array is accepted by the PoseDecoder?:
torch.Size([1, 64, 320, 96])
torch.Size([1, 64, 160, 48])
torch.Size([1, 128, 80, 24])
torch.Size([1, 256, 40, 12])
torch.Size([1, 512, 20, 6])
Perhaps I am reading the code wrongly, so I appreciate if anyone could explain if to me. Thank you so much!
Hi, I got a question regarding the input_features data for PoseDecoder network.
From the line below, the PoseDecoder accepts an input feature with number of channels equal to self.num_ch_enc[-1], which according to the ResnetMultiImageInput encoder, should be 512.
self.convs[("squeeze")] = nn.Conv2d(self.num_ch_enc[-1], 256, 1)
However, the output features of the ResnetEncoder have the following shapes, which means that only the last element of the features array is accepted by the PoseDecoder?: torch.Size([1, 64, 320, 96]) torch.Size([1, 64, 160, 48]) torch.Size([1, 128, 80, 24]) torch.Size([1, 256, 40, 12]) torch.Size([1, 512, 20, 6])
Perhaps I am reading the code wrongly, so I appreciate if anyone could explain if to me. Thank you so much!