microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

[BUG]deepspeed always load the whole model to each gpu, then OOM #4807

Open aohan237 opened 11 months ago

aohan237 commented 11 months ago

i use huggingface trl sfttrainer and peft and deepspeed to train a 6B model. i have a 4 12GB gpu.

when i use automodel.from_pretrained(device_map="auto"), it works. but the training process is so slow and gpu only has 25% working.

so i try to use deepspeed, but when using deepspeed device_map not work, so i deleted this.

i copy a stage 2 config from huggingface tutorial. { "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto", "torch_adam": true, "adam_w_mode": true } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto", "total_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": "auto", "contiguous_gradients": true }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } then try to run with deepspeed.

i suppose that deepspeed should work with shard model with model that can not fit the gpu.

but deepspeed always try to load the whole model on each gpu, it can not shard the model to load. and always return oom.

i suspect that even when you try to train bigger model such as 130B, there is no gpu vram can fit the whole model, you still have to shard the model.

so, it should work , but it does not.

can you tell me why? or is there any thing i need to know

ShukantPal commented 9 months ago

"zero_optimization": { "stage": 2,

ZeRO stage 2 does not partition the model parameters (only the optimizer states & gradients, which are training specific). DeepSpeed will still load the entire model parameters for forward passes on each GPU.

You need to enable ZeRO stage 3 so that DeepSpeed partitions the model itself across your 4 GPUs.

aohan237 commented 9 months ago

"zero_optimization": { "stage": 2,

ZeRO stage 2 does not partition the model parameters (only the optimizer states & gradients, which are training specific). DeepSpeed will still load the entire model parameters for forward passes on each GPU.

You need to enable ZeRO stage 3 so that DeepSpeed partitions the model itself across your 4 GPUs.

thanks ,i read the deepspeed doc,it seems that stage2 do the partition things
I tried stage3 or PyTorch fsdp using accelerate lib , it all end in oom,is there any thing I should know?

I will try stage 3 again

mano3-1 commented 3 months ago

Hi @aohan237 , I am facing similar issue, did you figure out why this is happening? Note: I was using zero3, still didn't work. Here is my zero3 config:

compute_environment: LOCAL_MACHINE
debug: True
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  zero_quantized_weights: true,
  zero_hpz_partition_size: 8,
  zero_quantized_gradients: true,

  contiguous_gradients: true,
  overlap_comm: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I have 8 GPUs on my machine. With device_map="auto", the training worked. But while usingDeep speed, I am getting OOM errors with above config.

fclearner commented 3 months ago

Hi @aohan237 , I am facing similar issue, did you figure out why this is happening? Note: I was using zero3, still didn't work. Here is my zero3 config:

compute_environment: LOCAL_MACHINE
debug: True
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  zero_quantized_weights: true,
  zero_hpz_partition_size: 8,
  zero_quantized_gradients: true,

  contiguous_gradients: true,
  overlap_comm: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I have 8 GPUs on my machine. With device_map="auto", the training worked. But while usingDeep speed, I am getting OOM errors with above config.

I have met the same problem ,have you solved, for my case , it works well on one gpu

sasaadi commented 2 days ago

I am having the same problem. How did you solve it?