lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.67k stars 4.52k forks source link

Error when trying to finetune `lmsys/vicuna-7b-v1.5` with 6 A100 40G GPUs #2318

Open kunqian-58 opened 1 year ago

kunqian-58 commented 1 year ago

script used to finetune lmsys/vicuna-7b-v1.5

CUDA_VISIBLE_DEVICES="7,6,5,4,3,2" torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path lmsys/vicuna-7b-v1.5   \
    --data_path data/vicuna_dummy_train.json \
    --bf16 True \
    --output_dir output_vicuna \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 258 \
    --gradient_checkpointing True \
    --lazy_preprocess True

Got the following error, which is identical to a previous issue #1428

/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/torch/distributed/fsdp/_state_dict_utils.py:312: UserWarning: Failed to clone() tensor with name lm_head.weight on rank 2. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of lm_head.weight. Error: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 2; 39.39 GiB total capacity; 37.52 GiB already allocated; 58.00 MiB free; 38.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  warnings.warn(
Traceback (most recent call last):
  File "/home/kun/llm_fine_tune/vicuna-finetune/fast-chat-training/FastChat/fastchat/train/train_mem.py", line 13, in <module>
    train()
  File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 284, in train
    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
  File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 76, in safe_save_model_for_hf_trainer
    cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
  File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 76, in <dictcomp>
    cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

while checking the cuda usage, it shows that the memory usage distribution is highly skrew, GPU 7 (in my case) reached full capacity while the other GPUs are not fully used. Any ideas?

jp-sft commented 10 months ago

+1