However, this contradicts my understanding. In fact, the teacher model does not need to be synchronized because it is static without tuning. The student model does need to be synchronized, otherwise it will be downgraded to a single GPU card training.
According to the code, only the teacher model is distributedly computed: https://github.com/huawei-noah/Pretrained-Language-Model/blob/a8a705e9c8c952e078b45d1091d3f0ed161483d8/TinyBERT/general_distill.py#L348-L358
However, this contradicts my understanding. In fact, the teacher model does not need to be synchronized because it is static without tuning. The student model does need to be synchronized, otherwise it will be downgraded to a single GPU card training.
Is my understanding wrong? Thanks for the answer!