Open peterjc123 opened 7 months ago
cc @younesbelkada
Just putting myself out here--I'd be happy to hear requests/requirements for design/support from the PyTorch side!
Hi!
Thanks so much @janeyx99 ! 🙏
Currently our approach is to use post_accumulate_gradient_hook
approach: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1300-L1309 by using dummy optimizers that return no-ops during step()
(idea borrowed from @hiyouga), per my understanding this do not support well some training schemes such as distributed training, we would probably need some help on PT if possible to support that (cc also @jiaweizzhao as we've been discussing this as well)
@younesbelkada Yes, it'll be interesting to discuss the current known pain points/requirements. Some questions that have already appeared are around DDP (which saves around buckets of gradients to accumulate once all the data's been processed) and gradient accumulation (which saves the gradients across iterations of fwd-bwd until an optim step). In both these cases, layer-wise optimizers will not save memory. Though a side note for GaLore: since the gradients should be smaller, the buffers holding the previously fullsized grads should be enabled to be smaller too.
Beyond those two instances, we should be able to theoretically compose this technology with FSDP/all other use cases where the gradients are allowed to be freed right after the optimizer update. I'm curious where people have run into problems with:
@janeyx99 @peterjc123 I guess this is similar to what Badam is doing: #30308
Feature request
Context: https://github.com/huggingface/transformers/pull/29588#discussion_r1523510004
Motivation
The layer-wise optimizers is not GaLore-specific. We could apply it to generic optimizers to save memory. For example, the 8bit Adam optimizer paired with the layer-wise optimization sounds like a pretty good option for me.
Your contribution
Have a try when it is supported