lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.85k stars 4.54k forks source link

Fine-tuning Vicuna-7B with Local GPUs #1428

Open lovelucymuch opened 1 year ago

lovelucymuch commented 1 year ago

torchrun --nproc_per_node=4 --master_port=20001 /raid/users/lifei/FastChat/fastchat/train/train_mem.py --model_name_or_path /raid/users/mrh/weights/vicuna-7b --data_path /raid/users/lifei/FastChat/playground/data/dummy.json --bf16 True --output_dir output --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True

RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

PolarPeak commented 1 year ago

me too

RomankovSergey commented 1 year ago

me too(

Sudz24 commented 1 year ago

yes its around this line cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}

What could be the issue?

richagadgil commented 1 year ago

Same here

kunqian-58 commented 1 year ago

same here