Hi Ratish, I have a problem here. After __computeloss() and loss.div(normalization).backward(), why in function shards , there is _torch.autograd.backward(inputs, grads, retain_graph=retaingraph) ? is it usual we use loss to do backward job? why use inputs backward again?
Hi Ratish, I have a problem here. After __computeloss() and loss.div(normalization).backward(), why in function shards , there is _torch.autograd.backward(inputs, grads, retain_graph=retaingraph) ? is it usual we use loss to do backward job? why use inputs backward again?