Error when trying to finetune `lmsys/vicuna-7b-v1.5` with 6 A100 40G GPUs

script used to finetune lmsys/vicuna-7b-v1.5

CUDA_VISIBLE_DEVICES="7,6,5,4,3,2" torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path lmsys/vicuna-7b-v1.5   \
    --data_path data/vicuna_dummy_train.json \
    --bf16 True \
    --output_dir output_vicuna \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 258 \
    --gradient_checkpointing True \
    --lazy_preprocess True

Got the following error, which is identical to a previous issue #1428

/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/torch/distributed/fsdp/_state_dict_utils.py:312: UserWarning: Failed to clone() tensor with name lm_head.weight on rank 2. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of lm_head.weight. Error: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 2; 39.39 GiB total capacity; 37.52 GiB already allocated; 58.00 MiB free; 38.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  warnings.warn(
Traceback (most recent call last):
  File "/home/kun/llm_fine_tune/vicuna-finetune/fast-chat-training/FastChat/fastchat/train/train_mem.py", line 13, in <module>
    train()
  File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 284, in train
    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
  File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 76, in safe_save_model_for_hf_trainer
    cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
  File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 76, in <dictcomp>
    cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

while checking the cuda usage, it shows that the memory usage distribution is highly skrew, GPU 7 (in my case) reached full capacity while the other GPUs are not fully used. Any ideas?

lm-sys / FastChat

Error when trying to finetune `lmsys/vicuna-7b-v1.5` with 6 A100 40G GPUs #2318