microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

Potential Memory Leak in GatingEncoder/Decoder of Fast_Dispatch #237

Closed KimmiShi closed 2 months ago

KimmiShi commented 2 months ago

HI, during my usage of the tutel fast_dispatch, I noticed that when gradient accumulation is enabled, the GPU memory consumption significantly exceeds that when gradient accumulation is OFF, and the consumed memory increases with the growth of gradient accumulation steps.

Upon reviewing the relevant code, I found that in the Encode/Decode implementation, the input tensor is directly placed in the ctx without invoking save_for_backward. According to the PyTorch documentation, this practice may lead to a memory leak.

I attempted to fix this issue, and the memory consumption no longer grows with the number of gradient accumulation steps.

ghostplant commented 2 months ago

Thanks for this finding!