Question about the use of with no grad

richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)

324 stars 41 forks source link

Hi @xiayandi

As explained in the paper, gradient flow is stopped at sampling step between generator and discriminator. More precisely, even without torch.no_grad, gradient computation will be interrupted at argmax in self.sample
I use torch.no_grad for two reasons

We can slightly speed up the operation that add logits and gumbel noise (in self.sample before argmax) by not doing its gradient calculation b/c eventually gradient calculation will be interrupted at argmax.
To clearly show that the whole sampling step doesn't pass gradients, gradients doesn't flow through two models.

Please tag me if there is any thing else I can help you.

richarddwang / electra_pytorch