microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.18k stars 4.07k forks source link

[BUG] Finetuning crashes with return code=-7 #3603

Closed chaitanyamalaviya closed 1 year ago

chaitanyamalaviya commented 1 year ago

Describe the bug I am finetuning t5-xl models with the HuggingFace trainer and deepspeed. However, during training (often early on in training), the process crashes with the message "exits with return code = -7" with no other error traceback.

To Reproduce I use the following command to run finetuning. Note that I am following a workflow similar to this tutorial. This issue is also relevant.

deepspeed --num_gpus=4 --master_port=$RANDOM src/experiments/run_gen_qa.py \
--model_name_or_path google/t5-v1_1-xl \
--train_file data/quoref/proc_train.json \
--validation_file data/quoref/proc_val.json \
--context_column context \
--question_column question \
--answer_column answer \
--do_train \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 48 \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--max_seq_length 512 \
--eval_accumulation_steps 100 \
--predict_with_generate \
--save_strategy "no" \
--save_total_limit 1 \
--gradient_checkpointing True \
--overwrite_cache True \
--report_to wandb \
--logging_steps 5 \
--output_dir models/t5_xl_finetuned_quoref_bs2 \
--overwrite_output_dir True \
--deepspeed src/configs/ds_new_config.json

ds_config file.

{
    "bf16": {
      "enabled": "auto"
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
      }
    },
    "scheduler": {
      "type": "WarmupLR",
      "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
      }
    },
    "zero_optimization": {
      "stage": 3,
      "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
      },
      "offload_param": {
        "device": "cpu",
        "pin_memory": true
      },
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/mnt/nlpgridio3/data/cmalaviya/anaconda3/lib/python3.8/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/mnt/nlpgridio3/data/cmalaviya/anaconda3/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots

1) Screenshot showing the error trace: Screenshot 2023-05-24 at 1 45 57 PM

System info (please complete the following information):

Launcher context deepspeed launcher

Docker context Not using docker.

Additional context This does not appear to be an OOM problem, as I have tried to increase memory allocated to my job to no avail (up to 500G). The behavior also appears to be happen randomly in the duration of finetuning (for eg, it could be right at the beginning or 100 steps into finetuning).

Tagging @jomayeri for help. Thanks a lot!

chaitanyamalaviya commented 1 year ago

@jomayeri just wanted to follow up about this. The issue still persists, so I would appreciate any help. Thanks a lot!

jomayeri commented 1 year ago

@chaitanyamalaviya I am unable to repro this issue on a box with 8xV100 32GB, and 500GB of CPU memory. For further debug I would advise:

  1. Launching DeepSpeed in one pane and watching htop and nvidia-smi in another to observe process and memory consumption at the point of failure and see if there is a pattern.
  2. Switching one of the offloads from CPU to NVME.