Open iyupan opened 2 years ago
@iyupan, thanks for reporting this issue.
To help investigate this, can you please provide repro steps?
Also, please clarify the expected behavior in this case. Should each parameter gradient be updated by accumulation six times for each backward pass?
Describe the bug I implement multiple transformer layers with only one-layer parameter (e.g., recursively use one layer six times to construct a 6-layer transformer), when I use activation checkpointing, there will be an AssertionError in Line 631, stage2.py.
To Reproduce This is the code that I used to call checkpoiting.
Expected behavior I expect normal running.
Unexpected behavior
Additional context deepspeed version: 0.3.16