Open fzyzcjy opened 1 week ago
I tried on my task and SOAP indeed makes the model converge faster on my 4090 GPU. The memory consumption is a little more than Adam but is still less than Distributed Shampoo. The parameters of SOAP I try are:
lr: 1.0e-3
weight_decay: 1.0e-6
betas: [0.9, 0.999]
others are set to default.
@294coder Thank you! May I know how large your model is?
Oh, sorry, I am working on vision task, so the model is pretty small. My model size is 22M, batchsize are set to 8. But SOAP still works for me.
If you're working on LLM with large models, maybe 24G mem is still small. Let's wait for the authors to answer.
Cheers.
@294coder Thanks for the information!
It is definetely more memory hungry. If it does not fit, I would recommend lowering the max_precond_dim (setting it to 0 recovers standard Adam). I would also recommend trying Muon optimizer.
@nikhilvyas Thank you! So do you have suggestions about the scenario of small batch sizes?
Hi thanks for the algorithm! I wonder whether SOAP or Shampoo is too memory-hungry or not - can it be run on a single 24GB 4090 to finetune some 0.5B ~ 1.5B LLMs? (say, llama3.2 1B) It seems that the paper mentions 2m token batch size, thus I guess it is mainly suitable for scenarios with a ton of GPUs. Thus I wonder whether it works / is designed for smaller scaled hardware.