huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.72k stars 26.94k forks source link

[Feature Request] add layer-wise optimizers #29732

Open peterjc123 opened 7 months ago

peterjc123 commented 7 months ago

Feature request

Context: https://github.com/huggingface/transformers/pull/29588#discussion_r1523510004

Motivation

The layer-wise optimizers is not GaLore-specific. We could apply it to generic optimizers to save memory. For example, the 8bit Adam optimizer paired with the layer-wise optimization sounds like a pretty good option for me.

Your contribution

Have a try when it is supported

amyeroberts commented 7 months ago

cc @younesbelkada

janeyx99 commented 7 months ago

Just putting myself out here--I'd be happy to hear requests/requirements for design/support from the PyTorch side!

younesbelkada commented 7 months ago

Hi! Thanks so much @janeyx99 ! 🙏 Currently our approach is to use post_accumulate_gradient_hook approach: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L1300-L1309 by using dummy optimizers that return no-ops during step() (idea borrowed from @hiyouga), per my understanding this do not support well some training schemes such as distributed training, we would probably need some help on PT if possible to support that (cc also @jiaweizzhao as we've been discussing this as well)

janeyx99 commented 7 months ago

@younesbelkada Yes, it'll be interesting to discuss the current known pain points/requirements. Some questions that have already appeared are around DDP (which saves around buckets of gradients to accumulate once all the data's been processed) and gradient accumulation (which saves the gradients across iterations of fwd-bwd until an optim step). In both these cases, layer-wise optimizers will not save memory. Though a side note for GaLore: since the gradients should be smaller, the buffers holding the previously fullsized grads should be enabled to be smaller too.

Beyond those two instances, we should be able to theoretically compose this technology with FSDP/all other use cases where the gradients are allowed to be freed right after the optimizer update. I'm curious where people have run into problems with:

acwme111 commented 6 months ago

@janeyx99 @peterjc123 I guess this is similar to what Badam is doing: #30308