Open ChaoChungWu-Johnson opened 1 year ago
hi @mrwyattii any idea about the index error using lora? or what should be the best practice for training 12b models? Since I also failed running the example run_13b.sh successfully as well?
May be I'm wrong, but I think the real resourse requests is much bigger than said in doc.
I use a 4*2080ti 22g node to train 1.3B,it costs 32hours for stage1. The vram of gpus are almost full And this is even under the batchsize with 4 , while in the example, it was set to 8. I also tried in colab with an A100 40, with even per_device_train_batch_size1 and gradient_checkpointing, it threw oom
But in the doc, the 1.3B training can be done with a single 48G a6000 within 2 hours, is that possible or I just mess up some important settings?
Describe the bug Hi, I was trying to finetune pythia-12b model via the following code in DeepSpeed-Chat 's step1 code. main.py is from
DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py
and according to zero3's estimation, the model finetuning should only take resources like
since I have 720 cpu RAM and 8* 32GB V100 GPU totally, this spec looks sufficient enough to run even with such small batch size (only 1 now) but I still got OOM error, and memory occupied rate up to almost 100% (30GB~31GB/32GB) for each gpu. Any idea why it consume so much memory?
other alternative I try to deal with OOM: change --gradient_checkpointing into --only_optimize_lora: but this resulted in index error, another bug I guess. the error message of this is quite long, I'll paste it in the additional context.
To Reproduce Steps to reproduce the behavior: just run the code with the above environment setting and code.
Expected behavior Should train succesfully.
ds_report output
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context Are you launching your experiment with the
deepspeed
launcher, MPI, or something else? yes, with deepspeedDocker context Are you using a specific docker image that you can share? No.
Additional context index error I try to use --only_optimize_lora instead of --gradient_checkpointing the message is long, so I paste the latter part of it and remove the repeated message from other subthreads, if you need the whole message, please tell me!