[FSDP] support activation offloading with FSDP

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

8.01k stars 979 forks source link

[FSDP] support activation offloading with FSDP #2038

Open shijie-wu opened 1 year ago

shijie-wu commented 1 year ago

Support whole model activation offloading with FSDP - working in conjunction with activation checkpointing - via

https://github.com/pytorch/pytorch/blob/e9ebda29d87ce0916ab08c06ab26fd3766a870e5/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py#L171-L191

As apply_activation_checkpointing does not wrap the overall root module, wrapping the overall root module with this could offload activation between layer, thus release more GPU memory. The diff should be small and i am happy to work on this.

muellerzr commented 1 year ago

CC @pacman100

pacman100 commented 1 year ago

The diff should be small and i am happy to work on this.

Hello @shijie-wu, thank you for bringing this to our notice. Any measure on how much GPU memory it saves and how much CPU memory usage goes up, also the hit on the throughput due to CPU <-> GPU data movement? Also, as you mentioned your interested to add this, looking forward to your PR 🤗.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

winglian commented 2 months ago

@muellerzr any chance this could be revived? Probably needs to gone in transformers though?