Open G-structure opened 1 month ago
that optim lacks other torch-specific expectations and like learning rate schedulers, some off-the-wall optimisers just don't work without modification. that distributed zero shampoo optim is a WIP technical prototype and not meant to be used in production, for example.
the SOAP one worked as expected for me. it just needs Closure input on its step. see here: https://github.com/bghira/SimpleTuner/blob/main/helpers/training/optimizers/soap/__init__.py
other problems of the original optim implementations linked is that they are not functioning with torch.compile and retain very slow performance (exaggerated in SOAP) and high memory overhead (also exaggerated in SOAP) even with ZeRO offload
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey,
I have been using Meta's Implementation of Distributed Shampoo and am seeing ~20% faster convergence of transformer based models compared to AdamW. Simo Ryu has done some nice investigations into the advantages of Shampoo.
I am looking to use Shampoo and Soap as an optimizer in accelerate but their current implementations introduce some breaking changes.
Focusing on Shampoo for now:
Distributed Shampoo disabled
state_dict
andload_state_dict
in favor of a customdistributed_state_dict
,load_distributed_state_dict
. Both of which require the modelsnamed_parameters()
to be passed in as args. More info as to why hereI have a hacky commit here to patch
accelerate/optimizers
. However I am still forced to bypassaccelerate.save()
and usedist_checkpoint.save_state_dict()
directly since the optimizer in the state_dict needs to have access to the modelsnamed_parameters()
.You can see this here in my e2-tts training code. I am able to save the model weights but am not yet able to load them again when using accelerate. This is where I am lost currently.
Also since I don't have access to the
named_parameters
untilaccelerate.prepare_model()
is called the shampoo optimizer needs to be defined in the model definition, which makes it awkward to switch between optimizers, see hereIdeally id be able to do something like this where I pass in the optimizer as I can with AdamW.
ofc when I setup everything with torch ddp, instead of accelerate everything works as intended :/
What would be the best approach for accelerate to support these custom optimizers (ones not part of torch)? My plan currently is to write a ShampooPlugin along the lines of the DeepSpeedPlugin, but it would be nice if the shampoo optimizer could be detected automatically without having to change the accelerate config. I am willing to put in the work to solve this so more projects can benefit from using these new optimizers with accelerate.
Any guidance would be much appreciated. :)