Open Sniper970119 opened 11 months ago
I use this code log the grad norm ,it work, but it seems to affect the speed。
def plot_gradients(model, step):
ave_grads = []
max_grads = []
layers = []
for n, p in model.named_parameters():
grad = safe_get_full_grad(p)
layers.append(n)
ave_grads.append(grad.abs().mean().cpu())
max_grads.append(grad.abs().max().cpu())
return layers, ave_grads, max_grads
# https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1813
if step % args.gradient_accumulation_steps == 0 and self.control.should_log:
grad = plot_gradients(model, step)
self.state.grad = grad
# https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1815
TFlops goes from 180 TFlops to 120 TFlops.
Is there some low cost method?
fun question
I want to use gradients to monitor if the model is training properly, like this
I change the
transformers.Trainer
https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1857 ,by add a function to get theparm.grad
to log grad norm front thezero_grad
.But it not work for the model when use the
DeepSpeed
,parm.grad
is None for all layers, I read the code and found comments said there is no way to return gradients when useDeepSpeed
.https://github.com/huggingface/accelerate/blob/69e4c3c54da3201eda288b500d138761e7a5221c/src/accelerate/accelerator.py#L1499
So how can I get the grad norm for track training state when I am using the
DeepSpeed
?I found a similar problem in
megatron-deepspeed
, but there seems to be no follow-uphttps://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/149