Activation Checkpointing conflicts with Weight Sharing

Describe the bug I implement multiple transformer layers with only one-layer parameter (e.g., recursively use one layer six times to construct a 6-layer transformer), when I use activation checkpointing, there will be an AssertionError in Line 631, stage2.py.

To Reproduce This is the code that I used to call checkpoiting.

hidden_states = torch.utils.checkpoint.checkpoint(
                custom(l, l + self.checkpoint_num_layers),
                hidden_states, attention_mask, padding_mask, bias_encoder)

Expected behavior I expect normal running.

Unexpected behavior

AssertionError: The parameter 97 has already been reduced.             Gradient computed twice for this partition.             Multiple gradient reduction is currently not supported

Additional context deepspeed version: 0.3.16

microsoft / DeepSpeed

Activation Checkpointing conflicts with Weight Sharing #2103