hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.05k stars 4.19k forks source link

使用Ascend 910B NPU跑llama-factory sft时报 Attempted to set the storage of a tensor on device "npu:4" to a storage on different device "npu:0". This is no longer allowed; the devices must match 错误 #1600

Closed Eisenhower closed 10 months ago

Eisenhower commented 11 months ago

Reminder

Reproduction

Traceback (most recent call last): File "/mnt/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/mnt/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/mnt/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/mnt/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 67, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1567, in train self._load_from_checkpoint(resume_from_checkpoint) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2178, in _load_from_checkpoint model.load_adapter(resume_from_checkpoint, model.active_adapter, is_trainable=True) File "/usr/local/lib/python3.10/site-packages/peft/peft_model.py", line 629, in load_adapter adapters_weights = load_peft_weights(model_id, device=torch_device, hf_hub_download_kwargs) File "/usr/local/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 224, in load_peft_weights adapters_weights = torch.load(filename, map_location=torch.device(device)) File "/usr/local/lib/python3.10/site-packages/torch_npu/utils/serialization.py", line 176, in load return _load(opened_zipfile, map_location, pickle_module, overall_storage=overall_storage, pickle_load_args) File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1422, in _load result = unpickler.load() File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1392, in persistent_load typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1366, in load_tensor wrap_storage=restore_location(storage, location), File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1299, in restore_location return default_restore_location(storage, str(map_location)) File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location result = fn(storage, location) File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 359, in _privateuse1_deserialize return getattr(obj, backend_name)(device_index) File "/usr/local/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 228, in wrap_storage_to untypedstorage.copy(self, non_blocking) RuntimeError: Attempted to set the storage of a tensor on device "npu:4" to a storage on different device "npu:0". This is no longer allowed; the devices must match. /usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpjm10ce_d'> _warnings.warn(warn_message, ResourceWarning) /usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprc7dq_yk'> _warnings.warn(warn_message, ResourceWarning) /usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp1lhygs4d'> _warnings.warn(warn_message, ResourceWarning) /usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpxyzexgpi'> _warnings.warn(warn_message, ResourceWarning) /usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp05vki2t1'> _warnings.warn(warn_message, ResourceWarning) [2023-11-22 07:22:13,683] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 107901 closing signal SIGTERM [2023-11-22 07:22:13,683] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 107903 closing signal SIGTERM [2023-11-22 07:22:16,673] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 107902) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train_bash.py FAILED

Expected behavior

能够正常运行sft任务

System Info

python 3.10.1 torch 2.1.0 torch_npu 2.1.0rc1 os ubuntu18.04

Others

No response

hiyouga commented 11 months ago

命令?

Eisenhower commented 11 months ago

命令?

accelerate launch src/train_bash.py \ --stage sft \ --model_name_or_path /mnt/Llama-2-7b-hf \ --do_train \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir ./ckpt \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16

accelerate config 里选的multi_npu

Eisenhower commented 11 months ago

单NPU卡目前看是在正常运行 image

louxingrui commented 11 months ago

想问下怎么在910B上跑单卡,要把device设置成cpu跑吗?默认好像用cuda,因为没有cuda()会报错

hiyouga commented 11 months ago

更新代码重试下

HIT-Owen commented 10 months ago

想问下怎么在910B上跑单卡,要把device设置成cpu跑吗?默认好像用cuda,因为没有cuda()会报错

老哥,解决了吗? 不想用mindspore

Vinobugme commented 10 months ago

想问下具体怎么在NPU环境下进行模型训练

TZJ12 commented 7 months ago

Traceback (most recent call last): File "/root/dabai/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/root/dabai/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/root/dabai/LLaMA-Factory/src/llmtuner/train/tuner.py", line 32, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/root/dabai/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 32, in run_sft dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft") File "/root/dabai/LLaMA-Factory/src/llmtuner/data/loader.py", line 140, in get_dataset for dataset_attr in get_dataset_list(data_args): File "/root/dabai/LLaMA-Factory/src/llmtuner/data/parser.py", line 62, in get_dataset_list raise ValueError( ValueError: Cannot open data/dataset_info.json due to Expecting value: line 7 column 1 (char 6). 麻烦问一下,这个问题怎么解决

ginreedcho commented 6 months ago

Traceback (most recent call last): File "/root/dabai/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/root/dabai/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/root/dabai/LLaMA-Factory/src/llmtuner/train/tuner.py", line 32, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/root/dabai/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 32, in run_sft dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft") File "/root/dabai/LLaMA-Factory/src/llmtuner/data/loader.py", line 140, in get_dataset for dataset_attr in get_dataset_list(data_args): File "/root/dabai/LLaMA-Factory/src/llmtuner/data/parser.py", line 62, in get_dataset_list raise ValueError( ValueError: Cannot open data/dataset_info.json due to Expecting value: line 7 column 1 (char 6). 麻烦问一下,这个问题怎么解决

我也出现了一模一样的问题,请问您最后解决了吗