Encoder layer unmatched with paper description

JohnYeung-dojjy commented 7 months ago

Hi, I was trying to predict body measurement data from silhouette image data, and after the some iterations of testing I found your paper on the same topic. My model structure is mostly similar with what proposed in your paper, except I didn't train a autoencoder to encode my image. Which is what I am gonna try. While I was coping your code, I found that there is a mis-match between the code and the original paper.

According to the paper part 2.2:

In the proposed architecture, the encoder consists of five 3×3 convolutional layers with 32 filters each, followed by batch normalization and Leaky ReLU activation.

But in utils/model.py, ReLU activation is used

class Deep2DEncoder(nn.Module):
    def __init__(self, image_size=512, kernel_size=3, n_filters=32, dropout=False, drop_prob=0.2):
        super(Deep2DEncoder, self).__init__()
        self.image_size = image_size
        self.n_filters = n_filters

        self.conv1 = nn.Sequential(
                nn.Conv2d(1, n_filters, kernel_size=(kernel_size,kernel_size), padding=(kernel_size//2,kernel_size//2)),
                nn.BatchNorm2d(n_filters),
                nn.ReLU(inplace = True),
        )
        self.pool = nn.MaxPool2d((2,2),(2,2))
        self.conv2 = nn.Sequential(
                nn.Conv2d(n_filters, n_filters, kernel_size=(kernel_size,kernel_size), padding=(kernel_size//2,kernel_size//2)),
                nn.BatchNorm2d(n_filters),
                nn.ReLU(inplace = True),
        )
        self.fc = nn.Sequential(
                nn.Linear(math.ceil(image_size/32 * image_size/32 * n_filters), 256),
        )

Which one should I follow? Is there any difference in performance? Thanks

JohnYeung-dojjy commented 7 months ago

Also, what could be the effects of combining the front/side view image reconstruction loss? It seemed more intuitive to train the front/side encoder separately, or at least back-prop the losses separately.

kundanthota commented 7 months ago

Hi, I was trying to predict body measurement data from silhouette image data, and after the some iterations of testing I found your paper on the same topic. My model structure is mostly similar with what proposed in your paper, except I didn't train a autoencoder to encode my image. Which is what I am gonna try. While I was coping your code, I found that there is a mis-match between the code and the original paper.

According to the paper part 2.2:

In the proposed architecture, the encoder consists of five 3×3 convolutional layers with 32 filters each, followed by batch normalization and Leaky ReLU activation.

But in utils/model.py, ReLU activation is used
class Deep2DEncoder(nn.Module):
    def __init__(self, image_size=512, kernel_size=3, n_filters=32, dropout=False, drop_prob=0.2):
        super(Deep2DEncoder, self).__init__()
        self.image_size = image_size
        self.n_filters = n_filters

        self.conv1 = nn.Sequential(
                nn.Conv2d(1, n_filters, kernel_size=(kernel_size,kernel_size), padding=(kernel_size//2,kernel_size//2)),
                nn.BatchNorm2d(n_filters),
                nn.ReLU(inplace = True),
        )
        self.pool = nn.MaxPool2d((2,2),(2,2))
        self.conv2 = nn.Sequential(
                nn.Conv2d(n_filters, n_filters, kernel_size=(kernel_size,kernel_size), padding=(kernel_size//2,kernel_size//2)),
                nn.BatchNorm2d(n_filters),
                nn.ReLU(inplace = True),
        )
        self.fc = nn.Sequential(
                nn.Linear(math.ceil(image_size/32 * image_size/32 * n_filters), 256),
        )
Which one should I follow? Is there any difference in performance? Thanks

Hi, Yes the architecture proposed in the paper is different to the code provided. The code is just for the reference and the performance was almost the same. You can try the one proposed in the paper.

kundanthota commented 7 months ago

Also, what could be the effects of combining the front/side view image reconstruction loss? It seemed more intuitive to train the front/side encoder separately, or at least back-prop the losses separately.

Combining losses might make the network more sensitive to differences between front and side views, potentially leading to better disentanglement of view-specific factors in the learned representations.

JohnYeung-dojjy commented 7 months ago

Understood, thank you very much.

JohnYeung-dojjy commented 7 months ago

Oh, one more thing. I noticed that you used the same conv layer multiple times in the forward function

def forward(self, x):   #512 x 512
        x1 = self.conv1(x)  
        x1 = self.pool(x1)  #256 x 256
        x2 = self.conv2(x1)
        x2 = self.pool(x2)  #128 x 128
        x3 = self.conv2(x2)
        x3 = self.pool(x3)  # 64 x 64
        x4 = self.conv2(x3)
        x4 = self.pool(x4)  # 32 x 32
        x5 = self.conv2(x4)
        x5 = self.pool(x5)  # 16 x 16

        flatten = x5.view(-1, math.ceil(self.image_size/32 * self.image_size/32 * self.n_filters))
        x6 = self.fc(flatten)

        return x6

From my understanding, this will reuse the same set of 32 filters created, and the weights will be updated multiple times during backprop. Is this intended behaviour? This kind of operation seems uncommon among machine learning community.

kundanthota commented 7 months ago

Oh, one more thing. I noticed that you used the same conv layer multiple times in the forward function
def forward(self, x):   #512 x 512
        x1 = self.conv1(x)  
        x1 = self.pool(x1)  #256 x 256
        x2 = self.conv2(x1)
        x2 = self.pool(x2)  #128 x 128
        x3 = self.conv2(x2)
        x3 = self.pool(x3)  # 64 x 64
        x4 = self.conv2(x3)
        x4 = self.pool(x4)  # 32 x 32
        x5 = self.conv2(x4)
        x5 = self.pool(x5)  # 16 x 16

        flatten = x5.view(-1, math.ceil(self.image_size/32 * self.image_size/32 * self.n_filters))
        x6 = self.fc(flatten)

        return x6
From my understanding, this will reuse the same set of 32 filters created, and the weights will be updated multiple times during backprop. Is this intended behaviour? This kind of operation seems uncommon among machine learning community.

The decision to reuse the same convolutional layer multiple times was made to simplify the network architecture and reduce the number of parameters. While this approach may seem unconventional, it is not uncommon in certain scenarios, particularly in cases where the network needs to progressively downsample the input.

By reusing the same convolutional layer with shared weights, we effectively enforce parameter sharing across different parts of the input space. This can encourage the network to learn more robust and generalizable features, as the same set of filters is applied to multiple regions of the input.

JohnYeung-dojjy commented 7 months ago

Wow, this is very interesting. Thank you for explaining this to me. I have always considered reusing layers as unpredictable and is discouraged. I have found this paper Convolutional Neural Networks with Layer Reuse talking about this, is there any other references that I can look into?

kundanthota / humanet

Encoder layer unmatched with paper description #1