Closed JohnYeung-dojjy closed 7 months ago
Also, what could be the effects of combining the front/side view image reconstruction loss? It seemed more intuitive to train the front/side encoder separately, or at least back-prop the losses separately.
Hi, I was trying to predict body measurement data from silhouette image data, and after the some iterations of testing I found your paper on the same topic. My model structure is mostly similar with what proposed in your paper, except I didn't train a autoencoder to encode my image. Which is what I am gonna try. While I was coping your code, I found that there is a mis-match between the code and the original paper.
According to the paper part 2.2:
In the proposed architecture, the encoder consists of five 3×3 convolutional layers with 32 filters each, followed by batch normalization and Leaky ReLU activation.
But in utils/model.py, ReLU activation is used
class Deep2DEncoder(nn.Module): def __init__(self, image_size=512, kernel_size=3, n_filters=32, dropout=False, drop_prob=0.2): super(Deep2DEncoder, self).__init__() self.image_size = image_size self.n_filters = n_filters self.conv1 = nn.Sequential( nn.Conv2d(1, n_filters, kernel_size=(kernel_size,kernel_size), padding=(kernel_size//2,kernel_size//2)), nn.BatchNorm2d(n_filters), nn.ReLU(inplace = True), ) self.pool = nn.MaxPool2d((2,2),(2,2)) self.conv2 = nn.Sequential( nn.Conv2d(n_filters, n_filters, kernel_size=(kernel_size,kernel_size), padding=(kernel_size//2,kernel_size//2)), nn.BatchNorm2d(n_filters), nn.ReLU(inplace = True), ) self.fc = nn.Sequential( nn.Linear(math.ceil(image_size/32 * image_size/32 * n_filters), 256), )
Which one should I follow? Is there any difference in performance? Thanks
Hi, Yes the architecture proposed in the paper is different to the code provided. The code is just for the reference and the performance was almost the same. You can try the one proposed in the paper.
Also, what could be the effects of combining the front/side view image reconstruction loss? It seemed more intuitive to train the front/side encoder separately, or at least back-prop the losses separately.
Combining losses might make the network more sensitive to differences between front and side views, potentially leading to better disentanglement of view-specific factors in the learned representations.
Understood, thank you very much.
Oh, one more thing. I noticed that you used the same conv layer multiple times in the forward function
def forward(self, x): #512 x 512
x1 = self.conv1(x)
x1 = self.pool(x1) #256 x 256
x2 = self.conv2(x1)
x2 = self.pool(x2) #128 x 128
x3 = self.conv2(x2)
x3 = self.pool(x3) # 64 x 64
x4 = self.conv2(x3)
x4 = self.pool(x4) # 32 x 32
x5 = self.conv2(x4)
x5 = self.pool(x5) # 16 x 16
flatten = x5.view(-1, math.ceil(self.image_size/32 * self.image_size/32 * self.n_filters))
x6 = self.fc(flatten)
return x6
From my understanding, this will reuse the same set of 32 filters created, and the weights will be updated multiple times during backprop. Is this intended behaviour? This kind of operation seems uncommon among machine learning community.
Oh, one more thing. I noticed that you used the same conv layer multiple times in the forward function
def forward(self, x): #512 x 512 x1 = self.conv1(x) x1 = self.pool(x1) #256 x 256 x2 = self.conv2(x1) x2 = self.pool(x2) #128 x 128 x3 = self.conv2(x2) x3 = self.pool(x3) # 64 x 64 x4 = self.conv2(x3) x4 = self.pool(x4) # 32 x 32 x5 = self.conv2(x4) x5 = self.pool(x5) # 16 x 16 flatten = x5.view(-1, math.ceil(self.image_size/32 * self.image_size/32 * self.n_filters)) x6 = self.fc(flatten) return x6
From my understanding, this will reuse the same set of 32 filters created, and the weights will be updated multiple times during backprop. Is this intended behaviour? This kind of operation seems uncommon among machine learning community.
The decision to reuse the same convolutional layer multiple times was made to simplify the network architecture and reduce the number of parameters. While this approach may seem unconventional, it is not uncommon in certain scenarios, particularly in cases where the network needs to progressively downsample the input.
By reusing the same convolutional layer with shared weights, we effectively enforce parameter sharing across different parts of the input space. This can encourage the network to learn more robust and generalizable features, as the same set of filters is applied to multiple regions of the input.
Wow, this is very interesting. Thank you for explaining this to me. I have always considered reusing layers as unpredictable and is discouraged. I have found this paper Convolutional Neural Networks with Layer Reuse talking about this, is there any other references that I can look into?
Hi, I was trying to predict body measurement data from silhouette image data, and after the some iterations of testing I found your paper on the same topic. My model structure is mostly similar with what proposed in your paper, except I didn't train a autoencoder to encode my image. Which is what I am gonna try. While I was coping your code, I found that there is a mis-match between the code and the original paper.
According to the paper part 2.2:
But in utils/model.py, ReLU activation is used
Which one should I follow? Is there any difference in performance? Thanks