microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

Activation Checkpointing conflicts with Weight Sharing #2103

Open iyupan opened 2 years ago

iyupan commented 2 years ago

Describe the bug I implement multiple transformer layers with only one-layer parameter (e.g., recursively use one layer six times to construct a 6-layer transformer), when I use activation checkpointing, there will be an AssertionError in Line 631, stage2.py.

To Reproduce This is the code that I used to call checkpoiting.

hidden_states = torch.utils.checkpoint.checkpoint(
                custom(l, l + self.checkpoint_num_layers),
                hidden_states, attention_mask, padding_mask, bias_encoder)

Expected behavior I expect normal running.

Unexpected behavior

AssertionError: The parameter 97 has already been reduced.             Gradient computed twice for this partition.             Multiple gradient reduction is currently not supported

Additional context deepspeed version: 0.3.16

tjruwase commented 2 years ago

@iyupan, thanks for reporting this issue.

To help investigate this, can you please provide repro steps?

Also, please clarify the expected behavior in this case. Should each parameter gradient be updated by accumulation six times for each backward pass?