huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.76k stars 939 forks source link

[FSDP] support activation offloading with FSDP #2038

Open shijie-wu opened 12 months ago

shijie-wu commented 12 months ago

Support whole model activation offloading with FSDP - working in conjunction with activation checkpointing - via

https://github.com/pytorch/pytorch/blob/e9ebda29d87ce0916ab08c06ab26fd3766a870e5/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py#L171-L191

As apply_activation_checkpointing does not wrap the overall root module, wrapping the overall root module with this could offload activation between layer, thus release more GPU memory. The diff should be small and i am happy to work on this.

muellerzr commented 12 months ago

CC @pacman100

pacman100 commented 10 months ago

The diff should be small and i am happy to work on this.

Hello @shijie-wu, thank you for bringing this to our notice. Any measure on how much GPU memory it saves and how much CPU memory usage goes up, also the hit on the throughput due to CPU <-> GPU data movement? Also, as you mentioned your interested to add this, looking forward to your PR 🤗.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

winglian commented 1 week ago

@muellerzr any chance this could be revived? Probably needs to gone in transformers though?