hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.42k stars 4.24k forks source link

多卡训练报错 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 9805) of binary: /root/miniconda3/envs/llm/bin/python #297

Closed sherryhongxy closed 1 year ago

sherryhongxy commented 1 year ago

单卡训练没有问题 换成分布式之后报错: image

训练参数:

!/bin/bash

export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export TORCH_CPP_LOG_LEVEL=INFO

accelerate launch python src/train_bash.py \ --stage sft \ --model_name_or_path ../../llama_models/llama-2-7b-chat-hf \ --prompt_template llama2\ --do_train \ --dataset alpaca_gpt4_zh_demo \ --finetuning_type lora \ --output_dir sft_model \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16

hiyouga commented 1 year ago

accelerate launch src/train_bash.py

sherryhongxy commented 1 year ago

解决了 可以跑通了 但是我跑70b的模型,遇到了类似的报错: image

!/bin/bash

export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1 export TORCH_CPP_LOG_LEVEL=INFO

accelerate launch src/train_bash.py \ --stage sft \ --model_name_or_path ../../llama_models/llama-2-70b-chat-hf \ --prompt_template llama2\ --do_train \ --dataset qac_data_demo \ --finetuning_type lora \ --output_dir sft_model \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16

hiyouga commented 1 year ago

显存或者内存爆了

sherryhongxy commented 1 year ago

添加 --quantization_bit 4 可以跑了 感恩