Open vikki7777 opened 1 year ago
Your model is small enough to fit on a single GPU. Deepspeed then does Data parallelism and runs more models. You should see a faster time to train
I changed the other two models: flan-t5-xl and flan-t5-xxl.
flan-t5-xl with single gpu:
deepspeed --include localhost:1 run_seq2seq_deepspeed.py \ --model_id model_xl \ --dataset_path dataset \ --epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --generation_max_length 111 \ --lr 1e-4 \ --deepspeed configs/ds_flan_t5_z3_offload.json
flan-t5-xl with six gpus:
deepspeed --include localhost:1,2,3,4,5,6 run_seq2seq_deepspeed.py \ --model_id model_xl \ --dataset_path dataset \ --epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --generation_max_length 111 \ --lr 1e-4 \ --deepspeed configs/ds_flan_t5_z3_offload.json
flan-t5-xxl with single gpu:
flan-t5-xxl with six gpus: also cuda out of memory
For flan-t5-xxl, the results are both out of memory, the process of model partition cannot be seen; For flan-t5-xl, I thought that when using a single gpu, if the memory occupied by the model is N, and then when using multiple gpus, the memory occupied by each gpu should be 1/N. Is this correct? However, the results show that when using multiple gpus, the memory usage of each GPU increases (from 8G to 12G), why? Looking forward to your reply :)
@philschmid Could you help me solve the above problem? Thanks a lot.
Hi Vikki, did you figure out this issue? I had 4 V100s and have observed similar situations for the GPU memory management. Even with cpu offload, I wasn't able to fine-tune the Flan-T5-XL (fp32) on my hardware. Although it is also mysterious that I could tune it when loading the model with torch_dtype=torch.bfloat16,
, which the OP said V100 does not support bf16 dtypes. Any insights or recommendations will be super helpful!
Thanks!
Attaching some dependencies info:
Cuda driver==12.1
transformers==4.35.2
torch==2.1.1+cu121
accelerate==0.25.0
peft==0.7.0
deepspeed==0.12.4
I am trying to run your code with the flan-t5-base model, I have one machine with multi V100, each gpu has 16G memory, and then I meet the following question: when I use single gpu, the gpu memory usage is 11.5G, when I ues 4 gpus, each gpu memory usage is 11.7G, deepspeed said it could partition the model, so I don't understand why gpu memory usage is almost same?
single gpu:
deepspeed --num_gpus=1 run_seq2seq_deepspeed.py \
--model_id model \
--dataset_path dataset \
--epochs 3 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--generation_max_length 111 \
--lr 1e-4 \
--deepspeed configs/ds_flan_t5_z3_offload.json
multi gpus:
deepspeed --num_gpus=4 run_seq2seq_deepspeed.py \
--model_id model \
--dataset_path dataset \
--epochs 3 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--generation_max_length 111 \
--lr 1e-4 \
--deepspeed configs/ds_flan_t5_z3_offload.json
ds_config: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }