rishizek / tensorflow-deeplab-v3

DeepLabv3 built in TensorFlow
MIT License
286 stars 102 forks source link

Network initialization #5

Closed qilinli closed 6 years ago

qilinli commented 6 years ago

Hi, @rishizek. Thanks for sharing your code. I am a bit of confusing about the weight initialization of Deeplabv3. As described in the original paper, the backbone of deeplabv3 is ResNet, which is pre-trained with ImageNet. How about the ASPP module? It is not applicable to ImageNet right? How do you initialize ASPP module?

rishizek commented 6 years ago

Hi, @qilinli , You are right. Because ASPP doesn't exit in ResNet, its weights are initialized default initializer of each layer. For example, the weights of layers_lib.conv2d is, I believe, initialized by xavier_initializer(), etc.

qilinli commented 6 years ago

@rishizek I see. It seems that nowadays no one explains random initialisation anymore :). BTW, I am also curious about the use of Batch Normalisation (BN) layer.

  1. It claimed that BN is trained with OS=16 on COCO and trainaug set of voc, then frozen BN and fine-tuning on trainval set with OS=8. So what parameters are frozen? mean, variance? alpha, beta? or all of them?
  2. In the stage of inference, the size of input could be different from the one in training, how could BN work, x - mean should not match right?
  3. Even with OS=16, I couldn't train crop_size=[513,513] with batch_size=16 using GTX1080Ti. The author is using somewhat enterprise -level GPU right?
rishizek commented 6 years ago

Hi @qilinli , let me answer your questions as far as I know,

  1. Good question! I thought we only need to freeze beta and gamma and implemented like so. But reconsidering it after your question and googling that, it seems better to fix moving mean and variance as well.
  2. Right, during inference the size of input could be different. However, here, BN is applied on ConvNets. So unlike the case of fully connected layers, its mean and variance are computed over each feature map, namely, mean or variance of batch x height x width by each feature. Thus, the size of input image is irrelevant for this model.
  3. The paper says they use K80, which has larger memory than GTX 1080Ti. So with K80, one can train the model with batch_size=16. Also, for another potential reason, they implement model better :)
qilinli commented 6 years ago

@rishizek Thanks for sharing your thoughts. I have some different opinions.

  1. It makes more sense to fix moving mean and variance. As claimed by the paper, compute BN normalization statistics (mean and std) with batchSize=16. We need a large batch size to compute the statistics, but for alpha and beta, they can be learned by back propagation even with batchSize=1. Intuitively, I see no reason to fix alpha and beta.

Thanks for telling me this. BN parameters are following the same parameter sharing rule of conv layer. I definitely missed this. :(

FYI, the author of deeplab released an official version of deeplabv3+ at https://github.com/tensorflow/models/tree/master/research/deeplab

rishizek commented 6 years ago

Hi @qilinli , No problem. Thank you for shareing your thoughts. Let me comment about Q1:

  1. You are right. BN can be learned even with batchSize=1. However, I am thinking that the reason they freeze BN with OS = 8 is not for theoretical reason, but they simply cannot fit them in GPU memory with OS = 8. With OS = 8 the model requires even more GPU memory than OS = 16. Also, apparently they freeze BN to save GPU memory as you can look at their code.

Yes, I'm aware with that even though I don't have sufficient time to investigate their code in detail yet ;)