nikhilvyas / SOAP

MIT License
119 stars 9 forks source link

Can it be used on 24GB cards? #10

Open fzyzcjy opened 1 week ago

fzyzcjy commented 1 week ago

Hi thanks for the algorithm! I wonder whether SOAP or Shampoo is too memory-hungry or not - can it be run on a single 24GB 4090 to finetune some 0.5B ~ 1.5B LLMs? (say, llama3.2 1B) It seems that the paper mentions 2m token batch size, thus I guess it is mainly suitable for scenarios with a ton of GPUs. Thus I wonder whether it works / is designed for smaller scaled hardware.

294coder commented 3 days ago

I tried on my task and SOAP indeed makes the model converge faster on my 4090 GPU. The memory consumption is a little more than Adam but is still less than Distributed Shampoo. The parameters of SOAP I try are:

  lr: 1.0e-3
  weight_decay: 1.0e-6
  betas: [0.9, 0.999]

others are set to default.

fzyzcjy commented 3 days ago

@294coder Thank you! May I know how large your model is?

294coder commented 3 days ago

Oh, sorry, I am working on vision task, so the model is pretty small. My model size is 22M, batchsize are set to 8. But SOAP still works for me.

294coder commented 3 days ago

If you're working on LLM with large models, maybe 24G mem is still small. Let's wait for the authors to answer.

Cheers.

fzyzcjy commented 3 days ago

@294coder Thanks for the information!

nikhilvyas commented 3 days ago

It is definetely more memory hungry. If it does not fit, I would recommend lowering the max_precond_dim (setting it to 0 recovers standard Adam). I would also recommend trying Muon optimizer.

fzyzcjy commented 3 days ago

@nikhilvyas Thank you! So do you have suggestions about the scenario of small batch sizes?