Open shijie-wu opened 1 year ago
CC @pacman100
The diff should be small and i am happy to work on this.
Hello @shijie-wu, thank you for bringing this to our notice. Any measure on how much GPU memory it saves and how much CPU memory usage goes up, also the hit on the throughput due to CPU <-> GPU data movement? Also, as you mentioned your interested to add this, looking forward to your PR 🤗.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@muellerzr any chance this could be revived? Probably needs to gone in transformers though?
Support whole model activation offloading with FSDP - working in conjunction with activation checkpointing - via
https://github.com/pytorch/pytorch/blob/e9ebda29d87ce0916ab08c06ab26fd3766a870e5/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py#L171-L191
As
apply_activation_checkpointing
does not wrap the overall root module, wrapping the overall root module with this could offload activation between layer, thus release more GPU memory. The diff should be small and i am happy to work on this.