philschmid / deep-learning-pytorch-huggingface

MIT License
580 stars 138 forks source link

Does deepspeed partition the model to multi GPUs? #15

Open vikki7777 opened 1 year ago

vikki7777 commented 1 year ago

I am trying to run your code with the flan-t5-base model, I have one machine with multi V100, each gpu has 16G memory, and then I meet the following question: when I use single gpu, the gpu memory usage is 11.5G, when I ues 4 gpus, each gpu memory usage is 11.7G, deepspeed said it could partition the model, so I don't understand why gpu memory usage is almost same?

single gpu: 1gpu deepspeed --num_gpus=1  run_seq2seq_deepspeed.py \     --model_id model \     --dataset_path dataset \     --epochs 3 \     --per_device_train_batch_size 16 \     --per_device_eval_batch_size 16 \     --generation_max_length 111 \     --lr 1e-4 \     --deepspeed configs/ds_flan_t5_z3_offload.json 

multi gpus: 4gpus deepspeed --num_gpus=4  run_seq2seq_deepspeed.py \     --model_id model \     --dataset_path dataset \     --epochs 3 \     --per_device_train_batch_size 16 \     --per_device_eval_batch_size 16 \     --generation_max_length 111 \     --lr 1e-4 \     --deepspeed configs/ds_flan_t5_z3_offload.json 

ds_config: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

philschmid commented 1 year ago

Your model is small enough to fit on a single GPU. Deepspeed then does Data parallelism and runs more models. You should see a faster time to train

vikki7777 commented 1 year ago

I changed the other two models: flan-t5-xl and flan-t5-xxl.

flan-t5-xl with single gpu:

flan-t5-xl-1gpu

deepspeed --include localhost:1  run_seq2seq_deepspeed.py \     --model_id model_xl \     --dataset_path dataset \     --epochs 3 \     --per_device_train_batch_size 1 \     --per_device_eval_batch_size 1 \     --generation_max_length 111 \     --lr 1e-4 \     --deepspeed configs/ds_flan_t5_z3_offload.json 

flan-t5-xl with six gpus:

flan-t5-xl-4gpus

deepspeed --include localhost:1,2,3,4,5,6  run_seq2seq_deepspeed.py \     --model_id model_xl \     --dataset_path dataset \     --epochs 3 \     --per_device_train_batch_size 1 \     --per_device_eval_batch_size 1 \     --generation_max_length 111 \     --lr 1e-4 \     --deepspeed configs/ds_flan_t5_z3_offload.json 

flan-t5-xxl with single gpu:

flan-t5-xxl-1gpu

flan-t5-xxl with six gpus: also cuda out of memory

For flan-t5-xxl, the results are both out of memory, the process of model partition cannot be seen; For flan-t5-xl, I thought that when using a single gpu, if the memory occupied by the model is N, and then when using multiple gpus, the memory occupied by each gpu should be 1/N. Is this correct? However, the results show that when using multiple gpus, the memory usage of each GPU increases (from 8G to 12G), why? Looking forward to your reply :)

vikki7777 commented 1 year ago

@philschmid Could you help me solve the above problem? Thanks a lot.

yulinliu101 commented 7 months ago

Hi Vikki, did you figure out this issue? I had 4 V100s and have observed similar situations for the GPU memory management. Even with cpu offload, I wasn't able to fine-tune the Flan-T5-XL (fp32) on my hardware. Although it is also mysterious that I could tune it when loading the model with torch_dtype=torch.bfloat16,, which the OP said V100 does not support bf16 dtypes. Any insights or recommendations will be super helpful!

Thanks!

Attaching some dependencies info:

Cuda driver==12.1
transformers==4.35.2
torch==2.1.1+cu121
accelerate==0.25.0
peft==0.7.0
deepspeed==0.12.4