SFT training ,single gpu (V100 32G), how to adjust my parameters to avoid OOM, thx

Modas-Li commented 1 year ago

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 0; 31.75 GiB total capacity; 23.21 GiB already allocated; 2.43 GiB free; 25.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-21 19:09:43,054] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15706 [2023-04-21 19:09:43,055] [ERROR] [launch.py:434:sigkill_handler] ['/data/anaconda3/bin/python', '-u', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py', '--local_rank=0', '--model_name_or_path', '/data/bloom-1b1', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--deepspeed', '--output_dir', '/data/deepspeed_output/step1/output_sft_0421_bloom1b1', '--per_device_train_batch_size', '1', '--num_train_epochs', '1', '--data_path', 'xc_data', '--gradient_checkpointing', '--zero_stage', '3'] exits with return code = 1

Modas-Li commented 1 year ago

Bloom-1.1b

mrwyattii commented 1 year ago

Hi @Modas-Li There are several parameters you can adjust, like batch size and zero_stage. Please see this error message for more information: https://github.com/microsoft/DeepSpeedExamples/blob/dcf67c001702811bfea7aec715844882bb44ee77/applications/DeepSpeed-Chat/train.py#L181

variationalkk commented 1 year ago

@Modas-Li you can change the batch size in the .sh file for every step. For example, add " --per_device_train_batch_size 4 --per_device_eval_batch_size 4 " in the .sh in step 1 .https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh.

microsoft / DeepSpeedExamples

SFT training ,single gpu (V100 32G), how to adjust my parameters to avoid OOM, thx #389