Closed Eisenhower closed 10 months ago
命令?
命令?
accelerate launch src/train_bash.py \ --stage sft \ --model_name_or_path /mnt/Llama-2-7b-hf \ --do_train \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir ./ckpt \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16
accelerate config 里选的multi_npu
单NPU卡目前看是在正常运行
想问下怎么在910B上跑单卡,要把device设置成cpu跑吗?默认好像用cuda,因为没有cuda()会报错
更新代码重试下
想问下怎么在910B上跑单卡,要把device设置成cpu跑吗?默认好像用cuda,因为没有cuda()会报错
老哥,解决了吗? 不想用mindspore
想问下具体怎么在NPU环境下进行模型训练
Traceback (most recent call last):
File "/root/dabai/LLaMA-Factory/src/train_bash.py", line 14, in
Traceback (most recent call last): File "/root/dabai/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/root/dabai/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/root/dabai/LLaMA-Factory/src/llmtuner/train/tuner.py", line 32, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/root/dabai/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 32, in run_sft dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft") File "/root/dabai/LLaMA-Factory/src/llmtuner/data/loader.py", line 140, in get_dataset for dataset_attr in get_dataset_list(data_args): File "/root/dabai/LLaMA-Factory/src/llmtuner/data/parser.py", line 62, in get_dataset_list raise ValueError( ValueError: Cannot open data/dataset_info.json due to Expecting value: line 7 column 1 (char 6). 麻烦问一下,这个问题怎么解决
我也出现了一模一样的问题,请问您最后解决了吗
Reminder
Reproduction
Traceback (most recent call last): File "/mnt/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/mnt/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/mnt/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/mnt/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 67, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1567, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2178, in _load_from_checkpoint
model.load_adapter(resume_from_checkpoint, model.active_adapter, is_trainable=True)
File "/usr/local/lib/python3.10/site-packages/peft/peft_model.py", line 629, in load_adapter
adapters_weights = load_peft_weights(model_id, device=torch_device, hf_hub_download_kwargs)
File "/usr/local/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 224, in load_peft_weights
adapters_weights = torch.load(filename, map_location=torch.device(device))
File "/usr/local/lib/python3.10/site-packages/torch_npu/utils/serialization.py", line 176, in load
return _load(opened_zipfile, map_location, pickle_module, overall_storage=overall_storage, pickle_load_args)
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1422, in _load
result = unpickler.load()
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1392, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1366, in load_tensor
wrap_storage=restore_location(storage, location),
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 1299, in restore_location
return default_restore_location(storage, str(map_location))
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
result = fn(storage, location)
File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 359, in _privateuse1_deserialize
return getattr(obj, backend_name)(device_index)
File "/usr/local/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 228, in wrap_storage_to
untypedstorage.copy(self, non_blocking)
RuntimeError: Attempted to set the storage of a tensor on device "npu:4" to a storage on different device "npu:0". This is no longer allowed; the devices must match.
/usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpjm10ce_d'>
_warnings.warn(warn_message, ResourceWarning)
/usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprc7dq_yk'>
_warnings.warn(warn_message, ResourceWarning)
/usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp1lhygs4d'>
_warnings.warn(warn_message, ResourceWarning)
/usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpxyzexgpi'>
_warnings.warn(warn_message, ResourceWarning)
/usr/local/lib/python3.10/tempfile.py:837: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp05vki2t1'>
_warnings.warn(warn_message, ResourceWarning)
[2023-11-22 07:22:13,683] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 107901 closing signal SIGTERM
[2023-11-22 07:22:13,683] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 107903 closing signal SIGTERM
[2023-11-22 07:22:16,673] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 107902) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Expected behavior
能够正常运行sft任务
System Info
python 3.10.1 torch 2.1.0 torch_npu 2.1.0rc1 os ubuntu18.04
Others
No response