Potential Memory Leak in GatingEncoder/Decoder of Fast_Dispatch

HI, during my usage of the tutel fast_dispatch, I noticed that when gradient accumulation is enabled, the GPU memory consumption significantly exceeds that when gradient accumulation is OFF, and the consumed memory increases with the growth of gradient accumulation steps.

Upon reviewing the relevant code, I found that in the Encode/Decode implementation, the input tensor is directly placed in the ctx without invoking save_for_backward. According to the PyTorch documentation, this practice may lead to a memory leak.

I attempted to fix this issue, and the memory consumption no longer grows with the number of gradient accumulation steps.

microsoft / tutel

Potential Memory Leak in GatingEncoder/Decoder of Fast_Dispatch #237