Open aohan237 opened 11 months ago
"zero_optimization": { "stage": 2,
ZeRO stage 2 does not partition the model parameters (only the optimizer states & gradients, which are training specific). DeepSpeed will still load the entire model parameters for forward passes on each GPU.
You need to enable ZeRO stage 3 so that DeepSpeed partitions the model itself across your 4 GPUs.
"zero_optimization": { "stage": 2,
ZeRO stage 2 does not partition the model parameters (only the optimizer states & gradients, which are training specific). DeepSpeed will still load the entire model parameters for forward passes on each GPU.
You need to enable ZeRO stage 3 so that DeepSpeed partitions the model itself across your 4 GPUs.
thanks ,i read the deepspeed doc,it seems that stage2 do the partition things
I tried stage3 or PyTorch fsdp using accelerate lib , it all end in oom,is there any thing I should know?
I will try stage 3 again
Hi @aohan237 , I am facing similar issue, did you figure out why this is happening? Note: I was using zero3, still didn't work. Here is my zero3 config:
compute_environment: LOCAL_MACHINE
debug: True
deepspeed_config:
gradient_accumulation_steps: 8
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
zero_quantized_weights: true,
zero_hpz_partition_size: 8,
zero_quantized_gradients: true,
contiguous_gradients: true,
overlap_comm: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
I have 8 GPUs on my machine. With device_map="auto", the training worked. But while usingDeep speed, I am getting OOM errors with above config.
Hi @aohan237 , I am facing similar issue, did you figure out why this is happening? Note: I was using zero3, still didn't work. Here is my zero3 config:
compute_environment: LOCAL_MACHINE debug: True deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 zero_quantized_weights: true, zero_hpz_partition_size: 8, zero_quantized_gradients: true, contiguous_gradients: true, overlap_comm: true distributed_type: DEEPSPEED downcast_bf16: 'no' enable_cpu_affinity: false machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
I have 8 GPUs on my machine. With device_map="auto", the training worked. But while usingDeep speed, I am getting OOM errors with above config.
I have met the same problem ,have you solved, for my case , it works well on one gpu
I am having the same problem. How did you solve it?
i use huggingface trl sfttrainer and peft and deepspeed to train a 6B model. i have a 4 12GB gpu.
when i use automodel.from_pretrained(device_map="auto"), it works. but the training process is so slow and gpu only has 25% working.
so i try to use deepspeed, but when using deepspeed device_map not work, so i deleted this.
i copy a stage 2 config from huggingface tutorial.
{ "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto", "torch_adam": true, "adam_w_mode": true } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto", "total_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": "auto", "contiguous_gradients": true }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
then try to run with deepspeed.i suppose that deepspeed should work with shard model with model that can not fit the gpu.
but deepspeed always try to load the whole model on each gpu, it can not shard the model to load. and always return oom.
i suspect that even when you try to train bigger model such as 130B, there is no gpu vram can fit the whole model, you still have to shard the model.
so, it should work , but it does not.
can you tell me why? or is there any thing i need to know