microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.05k stars 4.06k forks source link

How can I extract and log grad norm for individual layers #4555

Open Sniper970119 opened 11 months ago

Sniper970119 commented 11 months ago

I want to use gradients to monitor if the model is training properly, like this

image

I change the transformers.Trainer https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1857 ,by add a function to get the parm.grad to log grad norm front the zero_grad.

But it not work for the model when use the DeepSpeed, parm.grad is None for all layers, I read the code and found comments said there is no way to return gradients when use DeepSpeed.

 elif self.distributed_type == DistributedType.DEEPSPEED:
    # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
    # We cannot return the gradient norm because DeepSpeed does it.
    return None

https://github.com/huggingface/accelerate/blob/69e4c3c54da3201eda288b500d138761e7a5221c/src/accelerate/accelerator.py#L1499

So how can I get the grad norm for track training state when I am using the DeepSpeed?

I found a similar problem in megatron-deepspeed, but there seems to be no follow-up

https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/149

Sniper970119 commented 11 months ago

I use this code log the grad norm ,it work, but it seems to affect the speed。


def plot_gradients(model, step):
    ave_grads = []
    max_grads = []
    layers = []
    for n, p in model.named_parameters():
        grad = safe_get_full_grad(p)
        layers.append(n)
        ave_grads.append(grad.abs().mean().cpu())
        max_grads.append(grad.abs().max().cpu())

    return layers, ave_grads, max_grads

# https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1813

if step % args.gradient_accumulation_steps == 0 and self.control.should_log:
    grad = plot_gradients(model, step)
    self.state.grad = grad

# https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1815

TFlops goes from 180 TFlops to 120 TFlops.
Is there some low cost method?

image
benywon commented 8 months ago

fun question