Closed sherryhongxy closed 1 year ago
accelerate launch src/train_bash.py
解决了 可以跑通了 但是我跑70b的模型,遇到了类似的报错:
export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export TORCH_CPP_LOG_LEVEL=INFO
accelerate launch src/train_bash.py \ --stage sft \ --model_name_or_path ../../llama_models/llama-2-70b-chat-hf \ --prompt_template llama2\ --do_train \ --dataset qac_data_demo \ --finetuning_type lora \ --output_dir sft_model \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16
显存或者内存爆了
添加 --quantization_bit 4 可以跑了 感恩
单卡训练没有问题 换成分布式之后报错:
训练参数:
!/bin/bash
export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export TORCH_CPP_LOG_LEVEL=INFO
accelerate launch python src/train_bash.py \ --stage sft \ --model_name_or_path ../../llama_models/llama-2-7b-chat-hf \ --prompt_template llama2\ --do_train \ --dataset alpaca_gpt4_zh_demo \ --finetuning_type lora \ --output_dir sft_model \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16