hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.03k stars 4.19k forks source link

求助:初步训练,报混合精度问题 #1132

Closed whitedogedev closed 1 year ago

whitedogedev commented 1 year ago

`(llama_etuning) jovyan@torch-llm-0:/lpai/zhangxuewen/LLaMA-Efficient-Tuning$ CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \

--stage pt \
--model_name_or_path path_to_llama_model \
--do_train \
--dataset wiki_demo \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir path_to_pt_checkpoint \
--overwrite_cache \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--plot_loss \
--fp16

/home/jovyan/anaconda3/envs/llama_etuning/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. Traceback (most recent call last): File "/lpai/zhangxuewen/LLaMA-Efficient-Tuning/src/train_bash.py", line 14, in main() File "/lpai/zhangxuewen/LLaMA-Efficient-Tuning/src/train_bash.py", line 5, in main run_exp() File "/lpai/zhangxuewen/LLaMA-Efficient-Tuning/src/llmtuner/tuner/tune.py", line 20, in run_exp model_args, data_args, training_args, finetuning_args, generating_args, general_args = get_train_args(args) File "/lpai/zhangxuewen/LLaMA-Efficient-Tuning/src/llmtuner/tuner/core/parser.py", line 104, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args, general_args = parse_train_args(args) File "/lpai/zhangxuewen/LLaMA-Efficient-Tuning/src/llmtuner/tuner/core/parser.py", line 74, in parse_train_args return _parse_args(parser, args) File "/lpai/zhangxuewen/LLaMA-Efficient-Tuning/src/llmtuner/tuner/core/parser.py", line 53, in _parse_args return parser.parse_args_into_dataclasses() File "/home/jovyan/anaconda3/envs/llama_etuning/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 120, in init File "/home/jovyan/anaconda3/envs/llama_etuning/lib/python3.10/site-packages/transformers/training_args.py", line 1442, in __post_init__ raise ValueError( ValueError: FP16 Mixed precision training with AMP or APEX (--fp16) and FP16 half precision evaluation (--fp16_full_eval) can only be used on CUDA or NPU devices or certain XPU devices (with IPEX). `

image

hiyouga commented 1 year ago

测试下 torch.cuda.is_available()

heiqilin1985 commented 8 months ago

如果只按requirements 安装好像会默认安装成CPU版本的,如果你的n卡的,要自己去重装一下pytorch 如果是cona,进入环境执行: pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu118

另外如何查看pytorch版本: 先进入python: python 再引入包,查看版本: import torch print(torch.version) import transformers print(transformers.version)

如果有cpu的肯定就是弄错的。