Closed prachigarg23 closed 3 years ago
Hi, thanks for your interest in the method and the paper. I will try to clarify the doubts:
First of all, I think you are right, that if one does not change the BatchNorm layers to eval-mode, the statistics would continue to be updated, which would lead to a degradation in the performance on the old task. However, we have taken care of this and if we say that an encoder or a decoder is not trained that includes setting the BatchNorm layers to eval-mode, meaning that the statistics are not updated.
The feature extraction model uses BatchNorm layers (they are included in the ERFNet architecture) from the old task data. They are not updated during the extension step. When you take another look at Table 2, you will see that the 86.2% are the result of the FE stage 2 model and the FE stage 3 model has exactly this value. The 88.1% are from the single-stage training baseline, where all classes are trained at once. This is however not the model, that is used as the basis for extension, so there is actually no drop from 88.1% to 86.2%, but the 86.2% from the second stage is retained.
In my opinion the formally correct way would be to also fix the statistics of the BatchNorm layers for the old decoders and the encoder and only update the new decoder. I could however imagine that the updating the statistics of the encoder could improve performance on the new task. Especially if there is a domain gap between the old task's data and the new task's data, it is shown that in most cases it is beneficial to use statistics from the new task/domain. I recently did some experiments on this: https://arxiv.org/abs/2011.08502 . You would however be at risk to loose performance on the old task/domain.
Regarding training details we basically add a new decoder, freeze certain parts of the network and continue to train. However, I guess this is not what you are aiming at, so could you maybe be a little bit more specific?
Thanks for your reply! That answers most of my doubts.
I understand that the 88.1% accuracy model is not the one used for initialisation of the old decoder head and encoder in the FT, FE experiments. Reiterating, in your FE experiments, you fixed old BN layers and did not update BN parameters for the old decoder and shared encoder. But for new task decoder, you update BN based on new data statistics is what I understood.
Actually I'm performing experiments on ERFNet to study transferability between 2 segmentation datasets that have different domains by attaching a new decoder head to do FT and FE for the new domain. Say the correct way of using BN for the old and new decoders is to use domain specific ones (old data stats for old decoder and new data stats for new one). But like you said, the choice of which one is used for BN layers in encoder will make a difference to the stability-plasticity tradeoff, either favouring new task or retaining old task performance. As your work is one of the 1st works in ICL for semantic segmentation, I was trying to understand the conventional and correct way of performing the FT and FE experiments.
Do you think that training with fixed old batch norm layers for encoder and updating the ones in new decoder could lead to any issues in training?
During the FE/FT stage 3 training, initialisation of the old decoders and shared encoders is from the stage 2 checkpoints. right?
Lastly, is there any particular reason for the choice of the ERFNet architecture as the baseline architecture?
That is good to hear! With regard to your questions, I will try to give an answer based on the insights I have so far:
Thanks for the clarifications!
Hi, thanks for the paper and code. My doubt is wrt the feature extraction setting wherein a new decoder head is used to train the new task while freezing the shared encoder and old task decoder. In such cases, even after freezing the network weights, the batch norm running estimates are updated according to the new task data and the performance on the old task degrades sharply even with frozen weights due to the batch normalization.
Can you please tell if the feature extraction model used batch normalization based on old task data statistics or the new task data statistics? I'm referring to value 86.2% mIoU on cityscapes in Table 2 of the paper. If the weights were frozen then why is there a drop from 88.1% to 86.2% for the old task during feature extraction setting?
In such cases when we train feature extraction and fine-tuning baselines, what is the correct way to train the model wrt batch normalization layers? Which parts (encoder, decoder old, decoder new) are to use running vars of which tasks (old/new)?
Please give some training details of the finetuning and feature extraction experimental settings.