Closed Mamduh-k closed 1 year ago
Also, I would like to know why there is no gradient backpropagation to the teacher network.
The teacher is only upgraded by the exponential moving average update of the network weights to implement a temporal ensemble. Therefore, there are no gradients backpropagated into the teacher. Please, have a look at https://arxiv.org/pdf/1703.01780.pdf for further details.
Thank you for your answer, but I wonder the teacher network does not seem to have gradient back into it during training, so why is "detach()" needed?
The mix_loss.backward()
uses pseudo-labels predicted by the teacher network. Without the detach(), the gradients could be backpropagated into the teacher.
Dear author, is the ema_model trainable except for updating through ema?