microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.99k stars 1.02k forks source link

unable to load 4 7b size model in step3 #480

Open Mr-lonely0 opened 1 year ago

Mr-lonely0 commented 1 year ago

My actor model and critic model are both 7b in size. When I run the step3, there will be memory overflow, which consumes about 225G of memory space. Is there any solution to reduce memory consumption without changing the size of the model?

puyuanOT commented 1 year ago

I guess you could use LoRA, gradient checkpoint, lower batch size & gradient accumulation, off-load reference model to CPU RAM, decrease max_length.

Mr-lonely0 commented 1 year ago

I guess you could use LoRA, gradient checkpoint, lower batch size & gradient accumulation, off-load reference model to CPU RAM, decrease max_length.

In fact, I ran out of memory, not GPU memory. So is there any other solutions to solve this problem?

woqiang0515 commented 1 year ago

@Mr-lonely0 I tried training actor model and critic model with LLaMa model. The training in step3 crashed when torch.load(critic_model) which i guess is out of memory problem. Have you solve the problem you mentioned?