昇腾910B推理报错 - Githubissues

MTMTMTMTMTMTMTMT commented 7 months ago

Reminder

[x] I have read the README and searched the existing issues.

Reproduction

系统及版本：

Linux admin 5.15.0-87-generic #97-Ubuntu SMP Tue Oct 3 09:52:42 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Python 3.9.18

torch                         2.1.0

torch-npu                     2.1.0rc1

cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg
# version: 1.0
runtime_running_version=[7.0.0.5.242:7.0.RC1]
compiler_running_version=[7.0.0.5.242:7.0.RC1]
opp_running_version=[7.0.0.5.242:7.0.RC1]
toolkit_running_version=[7.0.0.5.242:7.0.RC1]
aoe_running_version=[7.0.0.5.242:7.0.RC1]
ncs_running_version=[7.0.0.5.242:7.0.RC1]
runtime_upgrade_version=[7.0.0.5.242:7.0.RC1]
compiler_upgrade_version=[7.0.0.5.242:7.0.RC1]
opp_upgrade_version=[7.0.0.5.242:7.0.RC1]
toolkit_upgrade_version=[7.0.0.5.242:7.0.RC1]
aoe_upgrade_version=[7.0.0.5.242:7.0.RC1]
ncs_upgrade_version=[7.0.0.5.242:7.0.RC1]
runtime_installed_version=[7.0.0.5.242:7.0.RC1]
compiler_installed_version=[7.0.0.5.242:7.0.RC1]
opp_installed_version=[7.0.0.5.242:7.0.RC1]
toolkit_installed_version=[7.0.0.5.242:7.0.RC1]
aoe_installed_version=[7.0.0.5.242:7.0.RC1]
ncs_installed_version=[7.0.0.5.242:7.0.RC1]

+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc3                 Version: 23.0.rc3                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 2     910B4               | OK            | 92.7        64                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          3279 / 32768         |
+===========================+===============+====================================================+
| 5     910B4               | OK            | 107.3       71                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          3278 / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+

运行参数:

python src/cli_demo.py \
    --model_name_or_path /mnt/sdc/models/Qwen-1_8B-Chat/ \
    --template qwen

日志

[INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file qwen.tiktoken [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file tokenizer.json [INFO|configuration_utils.py:727] 2024-01-31 06:52:54,786 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:727] 2024-01-31 06:52:54,788 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:792] 2024-01-31 06:52:54,789 >> Model config QWenConfig { "_name_or_path": "/mnt/sdc/models/Qwen-1_8B-Chat/", "architectures": [ "QWenLMHeadModel" ], "attn_dropout_prob": 0.0, "auto_map": { "AutoConfig": "configuration_qwen.QWenConfig", "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel" }, "bf16": false, "emb_dropout_prob": 0.0, "fp16": false, "fp32": false, "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 11008, "kv_channels": 128, "layer_norm_epsilon": 1e-06, "max_position_embeddings": 8192, "model_type": "qwen", "no_bias": true, "num_attention_heads": 16, "num_hidden_layers": 24, "onnx_safe": null, "rotary_emb_base": 10000, "rotary_pct": 1.0, "scale_attn_weights": true, "seq_length": 8192, "softmax_in_fp32": false, "tie_word_embeddings": false, "tokenizer_class": "QWenTokenizer", "transformers_version": "4.37.2", "use_cache": true, "use_cache_kernel": false, "use_cache_quantization": false, "use_dynamic_ntk": true, "use_flash_attn": "auto", "use_logn_attn": true, "vocab_size": 151936 } [INFO|modeling_utils.py:3473] 2024-01-31 06:52:54,831 >> loading weights file /mnt/sdc/models/Qwen-1_8B-Chat/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-01-31 06:52:54,831 >> Instantiating QWenLMHeadModel model under default dtype torch.float16. [INFO|configuration_utils.py:826] 2024-01-31 06:52:54,832 >> Generate config GenerationConfig {} Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.63it/s] [INFO|modeling_utils.py:4350] 2024-01-31 06:52:56,534 >> All model checkpoint weights were used when initializing QWenLMHeadModel. [INFO|modeling_utils.py:4358] 2024-01-31 06:52:56,534 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at /mnt/sdc/models/Qwen-1_8B-Chat/. If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-01-31 06:52:56,537 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/generation_config.json [INFO|configuration_utils.py:826] 2024-01-31 06:52:56,538 >> Generate config GenerationConfig { "chat_format": "chatml", "do_sample": true, "eos_token_id": 151643, "max_new_tokens": 512, "max_window_size": 6144, "pad_token_id": 151643, "repetition_penalty": 1.1, "top_k": 0, "top_p": 0.8 } 01/31/2024 06:52:56 - INFO - llmtuner.model.adapter - Adapter is not found at evaluation, load the base model. 01/31/2024 06:52:56 - INFO - llmtuner.model.loader - trainable params: 0 || all params: 1836828672 || trainable%: 0.0000 01/31/2024 06:52:56 - INFO - llmtuner.model.loader - This IS expected that the trainable params is 0 if you are using model for inference only. 01/31/2024 06:53:07 - INFO - llmtuner.data.template - Add eos token: <|endoftext|> 01/31/2024 06:53:07 - INFO - llmtuner.data.template - Add pad token: <|endoftext|> 01/31/2024 06:53:07 - INFO - llmtuner.data.template - Replace eos token: <|im_end|> Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application. User: nihao Assistant: Exception in thread Thread-7: Traceback (most recent call last): File "/root/anaconda3/envs/mind/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/root/anaconda3/envs/mind/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/utils.py", line 1349, in generate model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation( File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/utils.py", line 449, in _prepare_attention_mask_for_generation is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/_tensor.py", line 1059, in __contains__ return (element == self).any().item() # type: ignore[union-attr] RuntimeError: allocate:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:1406 NPU error, error code is 107002 [Error]: The context is empty. Check whether acl.rt.set_context or acl.rt.set_device is called. EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship. TraceBack (most recent call last): ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null] Traceback (most recent call last): File "/mnt/sdc/projects/LLaMA-Factory/src/cli_demo.py", line 49, in main() File "/mnt/sdc/projects/LLaMA-Factory/src/cli_demo.py", line 41, in main for new_text in chat_model.stream_chat(messages): File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/mnt/sdc/projects/LLaMA-Factory/src/llmtuner/chat/chat_model.py", line 133, in stream_chat yield from streamer File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/streamers.py", line 223, in __next__ value = self.text_queue.get(timeout=self.timeout) File "/root/anaconda3/envs/mind/lib/python3.9/queue.py", line 179, in get raise Empty _queue.Empty /root/anaconda3/envs/mind/lib/python3.9/tempfile.py:821: ResourceWarning: Implicitly cleaning up _warnings.warn(warn_message, ResourceWarning)

Expected behavior

可以正常输出推理内容参考这篇博客说要手动指定npu，但我不知道要在哪里手动指定，我尝试过在src/llmtuner/extras/misc.py 135行处手动指定npu，但没有效果。

System Info

transformers version: 4.37.2
Platform: Linux-5.15.0-87-generic-aarch64-with-glibc2.35
Python version: 3.9.18
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Others

单卡训练是可以的

日志

01/31/2024 06:50:43 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training. [INFO|training_args.py:1828] 2024-01-31 06:50:43,618 >> PyTorch: setting up devices /root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/training_args.py:1741: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead. warnings.warn( 01/31/2024 06:50:43 - INFO - llmtuner.hparams.parser - Process rank: 0, device: npu:0, n_gpu: 1 distributed training: True, compute dtype: torch.float16 01/31/2024 06:50:43 - INFO - llmtuner.hparams.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.001, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=path_to_sft_checkpoint/runs/Jan31_06-50-34_admin, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, output_dir=path_to_sft_checkpoint, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=path_to_sft_checkpoint, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, sortish_sampler=False, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:50:43,632 >> loading file qwen.tiktoken [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:50:43,632 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:50:43,633 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:50:43,633 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:50:43,633 >> loading file tokenizer.json [INFO|configuration_utils.py:727] 2024-01-31 06:50:44,408 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:727] 2024-01-31 06:50:44,409 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:792] 2024-01-31 06:50:44,411 >> Model config QWenConfig { "_name_or_path": "/mnt/sdc/models/Qwen-1_8B-Chat/", "architectures": [ "QWenLMHeadModel" ], "attn_dropout_prob": 0.0, "auto_map": { "AutoConfig": "configuration_qwen.QWenConfig", "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel" }, "bf16": false, "emb_dropout_prob": 0.0, "fp16": false, "fp32": false, "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 11008, "kv_channels": 128, "layer_norm_epsilon": 1e-06, "max_position_embeddings": 8192, "model_type": "qwen", "no_bias": true, "num_attention_heads": 16, "num_hidden_layers": 24, "onnx_safe": null, "rotary_emb_base": 10000, "rotary_pct": 1.0, "scale_attn_weights": true, "seq_length": 8192, "softmax_in_fp32": false, "tie_word_embeddings": false, "tokenizer_class": "QWenTokenizer", "transformers_version": "4.37.2", "use_cache": true, "use_cache_kernel": false, "use_cache_quantization": false, "use_dynamic_ntk": true, "use_flash_attn": "auto", "use_logn_attn": true, "vocab_size": 151936 } [INFO|modeling_utils.py:3473] 2024-01-31 06:50:44,454 >> loading weights file /mnt/sdc/models/Qwen-1_8B-Chat/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-01-31 06:50:44,455 >> Instantiating QWenLMHeadModel model under default dtype torch.float16. [INFO|configuration_utils.py:826] 2024-01-31 06:50:44,456 >> Generate config GenerationConfig {} Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.43it/s] [INFO|modeling_utils.py:4350] 2024-01-31 06:50:46,350 >> All model checkpoint weights were used when initializing QWenLMHeadModel. [INFO|modeling_utils.py:4358] 2024-01-31 06:50:46,350 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at /mnt/sdc/models/Qwen-1_8B-Chat/. If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-01-31 06:50:46,352 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/generation_config.json [INFO|configuration_utils.py:826] 2024-01-31 06:50:46,353 >> Generate config GenerationConfig { "chat_format": "chatml", "do_sample": true, "eos_token_id": 151643, "max_new_tokens": 512, "max_window_size": 6144, "pad_token_id": 151643, "repetition_penalty": 1.1, "top_k": 0, "top_p": 0.8 } [WARNING|modeling_utils.py:2132] 2024-01-31 06:50:46,354 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model. 01/31/2024 06:50:46 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled. 01/31/2024 06:50:46 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA 01/31/2024 06:50:46 - INFO - llmtuner.model.loader - trainable params: 1572864 || all params: 1838401536 || trainable%: 0.0856 01/31/2024 06:50:46 - INFO - llmtuner.data.template - Add eos token: <|endoftext|> 01/31/2024 06:50:46 - INFO - llmtuner.data.template - Add pad token: <|endoftext|> 01/31/2024 06:50:46 - INFO - llmtuner.data.template - Replace eos token: <|im_end|> 01/31/2024 06:50:46 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at data/self_cognition.json. Using custom data configuration default-d71ab7cd15745fed Loading Dataset Infos from /root/anaconda3/envs/mind/lib/python3.9/site-packages/datasets/packaged_modules/json Overwrite dataset info from restored data version if exists. Loading Dataset info from /root/.cache/huggingface/datasets/json/default-d71ab7cd15745fed/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96 Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d71ab7cd15745fed/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96) Loading Dataset info from /root/.cache/huggingface/datasets/json/default-d71ab7cd15745fed/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96 Converting format of dataset: 0%| | 0/80 [00:00system You are a helpful assistant.<|im_end|> <|im_start|>user 你好<|im_end|> <|im_start|>assistant 您好，我是，一个由开发的 AI 助手，很高兴认识您。请问我能为您做些什么？<|im_end|> label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 111308, 3837, 104198, 366, 7535, 29, 3837, 46944, 67071, 366, 26694, 868, 29, 81947, 28291, 9370, 15235, 54599, 102, 44934, 3837, 112169, 100720, 87026, 1773, 14880, 107557, 26232, 102804, 99190, 109565, 11319, 151645] labels: 您好，我是，一个由开发的 AI 助手，很高兴认识您。请问我能为您做些什么？<|im_end|> [INFO|training_args.py:1828] 2024-01-31 06:50:50,125 >> PyTorch: setting up devices [INFO|trainer.py:571] 2024-01-31 06:50:51,538 >> Using auto half precision backend [INFO|trainer.py:1721] 2024-01-31 06:50:51,873 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-01-31 06:50:51,873 >> Num examples = 80 [INFO|trainer.py:1723] 2024-01-31 06:50:51,873 >> Num Epochs = 10 [INFO|trainer.py:1724] 2024-01-31 06:50:51,873 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1727] 2024-01-31 06:50:51,873 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1728] 2024-01-31 06:50:51,873 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1729] 2024-01-31 06:50:51,873 >> Total optimization steps = 50 [INFO|trainer.py:1730] 2024-01-31 06:50:51,874 >> Number of trainable parameters = 1,572,864 0%| | 0/50 [00:00> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 51.7882, 'train_samples_per_second': 15.448, 'train_steps_per_second': 0.965, 'train_loss': 0.6501767778396607, 'epoch': 10.0} 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:51<00:00, 1.04s/it] [INFO|trainer.py:2936] 2024-01-31 06:51:43,664 >> Saving model checkpoint to path_to_sft_checkpoint /root/anaconda3/envs/mind/lib/python3.9/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in /mnt/sdc/models/Qwen-1_8B-Chat/ - will assume that the vocabulary was not modified. warnings.warn( [INFO|tokenization_utils_base.py:2433] 2024-01-31 06:51:43,713 >> tokenizer config file saved in path_to_sft_checkpoint/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-01-31 06:51:43,713 >> Special tokens file saved in path_to_sft_checkpoint/special_tokens_map.json ***** train metrics ***** epoch = 10.0 train_loss = 0.6502 train_runtime = 0:00:51.78 train_samples_per_second = 15.448 train_steps_per_second = 0.965 Figure saved: path_to_sft_checkpoint/training_loss.png 01/31/2024 06:51:44 - WARNING - llmtuner.extras.ploting - No metric eval_loss to plot. [INFO|modelcard.py:452] 2024-01-31 06:51:44,071 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}} /root/anaconda3/envs/mind/lib/python3.9/tempfile.py:821: ResourceWarning: Implicitly cleaning up _warnings.warn(warn_message, ResourceWarning)

hiyouga commented 7 months ago

单卡推理呢，指定环境变量

MTMTMTMTMTMTMTMT commented 7 months ago

单卡推理呢，指定环境变量

我看代码中读的环境变量是LOCAL_RANK，这个我也试过，也没有效果。我手动指定"npu:0"也不行。

hiyouga commented 7 months ago

应该是 CUDA_VISIBLE_DEVICES

MTMTMTMTMTMTMTMT commented 7 months ago

应该是 CUDA_VISIBLE_DEVICES

感谢回复，下面是加上环境变量后的日志命令

CUDA_VISIBLE_DEVICES=0 python src/cli_demo.py \
     --model_name_or_path /mnt/sdc/models/Qwen-1_8B-Chat \
     --template qwen

日志

[INFO|tokenization_utils_base.py:2025] 2024-01-31 13:19:56,141 >> loading file qwen.tiktoken [INFO|tokenization_utils_base.py:2025] 2024-01-31 13:19:56,141 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 13:19:56,141 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 13:19:56,141 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 13:19:56,141 >> loading file tokenizer.json [INFO|configuration_utils.py:727] 2024-01-31 13:19:56,820 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:727] 2024-01-31 13:19:56,821 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:792] 2024-01-31 13:19:56,822 >> Model config QWenConfig { "_name_or_path": "/mnt/sdc/models/Qwen-1_8B-Chat/", "architectures": [ "QWenLMHeadModel" ], "attn_dropout_prob": 0.0, "auto_map": { "AutoConfig": "configuration_qwen.QWenConfig", "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel" }, "bf16": false, "emb_dropout_prob": 0.0, "fp16": false, "fp32": false, "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 11008, "kv_channels": 128, "layer_norm_epsilon": 1e-06, "max_position_embeddings": 8192, "model_type": "qwen", "no_bias": true, "num_attention_heads": 16, "num_hidden_layers": 24, "onnx_safe": null, "rotary_emb_base": 10000, "rotary_pct": 1.0, "scale_attn_weights": true, "seq_length": 8192, "softmax_in_fp32": false, "tie_word_embeddings": false, "tokenizer_class": "QWenTokenizer", "transformers_version": "4.37.2", "use_cache": true, "use_cache_kernel": false, "use_cache_quantization": false, "use_dynamic_ntk": true, "use_flash_attn": "auto", "use_logn_attn": true, "vocab_size": 151936 } [INFO|modeling_utils.py:3473] 2024-01-31 13:19:56,862 >> loading weights file /mnt/sdc/models/Qwen-1_8B-Chat/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-01-31 13:19:56,862 >> Instantiating QWenLMHeadModel model under default dtype torch.float16. [INFO|configuration_utils.py:826] 2024-01-31 13:19:56,863 >> Generate config GenerationConfig {} Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.56it/s] [INFO|modeling_utils.py:4350] 2024-01-31 13:19:58,619 >> All model checkpoint weights were used when initializing QWenLMHeadModel. [INFO|modeling_utils.py:4358] 2024-01-31 13:19:58,619 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at /mnt/sdc/models/Qwen-1_8B-Chat/. If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-01-31 13:19:58,622 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/generation_config.json [INFO|configuration_utils.py:826] 2024-01-31 13:19:58,622 >> Generate config GenerationConfig { "chat_format": "chatml", "do_sample": true, "eos_token_id": 151643, "max_new_tokens": 512, "max_window_size": 6144, "pad_token_id": 151643, "repetition_penalty": 1.1, "top_k": 0, "top_p": 0.8 } 01/31/2024 13:19:58 - INFO - llmtuner.model.adapter - Adapter is not found at evaluation, load the base model. 01/31/2024 13:19:58 - INFO - llmtuner.model.loader - trainable params: 0 || all params: 1836828672 || trainable%: 0.0000 01/31/2024 13:19:58 - INFO - llmtuner.model.loader - This IS expected that the trainable params is 0 if you are using model for inference only. 01/31/2024 13:20:09 - INFO - llmtuner.data.template - Add eos token: <|endoftext|> 01/31/2024 13:20:09 - INFO - llmtuner.data.template - Add pad token: <|endoftext|> 01/31/2024 13:20:09 - INFO - llmtuner.data.template - Replace eos token: <|im_end|> Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application. User: nihao Assistant: Exception in thread Thread-7: Traceback (most recent call last): File "/root/anaconda3/envs/mind/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/root/anaconda3/envs/mind/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/utils.py", line 1349, in generate model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation( File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/utils.py", line 449, in _prepare_attention_mask_for_generation is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/_tensor.py", line 1059, in __contains__ return (element == self).any().item() # type: ignore[union-attr] RuntimeError: allocate:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:1406 NPU error, error code is 107002 [Error]: The context is empty. Check whether acl.rt.set_context or acl.rt.set_device is called. EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship. TraceBack (most recent call last): ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null] Traceback (most recent call last): File "/mnt/sdc/projects/LLaMA-Factory/src/cli_demo.py", line 49, in main() File "/mnt/sdc/projects/LLaMA-Factory/src/cli_demo.py", line 41, in main for new_text in chat_model.stream_chat(messages): File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/mnt/sdc/projects/LLaMA-Factory/src/llmtuner/chat/chat_model.py", line 133, in stream_chat yield from streamer File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/streamers.py", line 223, in __next__ value = self.text_queue.get(timeout=self.timeout) File "/root/anaconda3/envs/mind/lib/python3.9/queue.py", line 179, in get raise Empty _queue.Empty /root/anaconda3/envs/mind/lib/python3.9/tempfile.py:821: ResourceWarning: Implicitly cleaning up _warnings.warn(warn_message, ResourceWarning)

MTMTMTMTMTMTMTMT commented 7 months ago

@hiyouga

我参考#I6KS6A在模型分发前设置torch_npu.npu.set_device('npu:0') 但也无济于事。

70557dzqc commented 7 months ago

import torch_npu
from torch_npu.contrib import transfer_to_npu
model = AutoModelForCausalLM.from_pretrained(model_weight_path, device_map="npu:0", torch_dtype=torch.bfloat16, trust_remote_code=True)

这样试试

MTMTMTMTMTMTMTMT commented 7 months ago

@70557dzqc 还是一样的问题

RuntimeError: allocate:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:1406 NPU error, error code is 107002
[Error]: The context is empty.
        Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

MTMTMTMTMTMTMTMT commented 7 months ago

请问这个有解吗？mindformers太难用了

liuhao-0666 commented 6 months ago

请问这个有解吗？mindformers太难用了

话说不应该用mindspore来改写torch吗

ZhuoranLyu commented 6 months ago

别用流式输出就行了，stream_chat那边起了一个新线程，你要在新线程里面set_device一下，或者干脆就用chat就行了

MTMTMTMTMTMTMTMT commented 6 months ago

@ZhuoranLyu 感谢回复，按照您的建议，我启动了api服务，stream参数为false，推理的错误还是与之前一致。日志如下

日志

Traceback (most recent call last): File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 404, in run_asgi result = await app( # type: ignore[func-returns-value] File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__ return await self.app(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in __call__ await self.app(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 62, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 762, in __call__ await self.middleware_stack(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 782, in app await route.handle(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/fastapi/routing.py", line 299, in app raise e File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/fastapi/routing.py", line 294, in app raw_response = await run_endpoint_function( File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/mnt/sdc/projects/LLaMA-Factory/src/llmtuner/api/app.py", line 125, in create_chat_completion return await loop.run_in_executor(None, chat_completion, input_messages, system, tools, request) File "/root/anaconda3/envs/py39/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/mnt/sdc/projects/LLaMA-Factory/src/llmtuner/api/app.py", line 135, in chat_completion responses = chat_model.chat( File "/root/anaconda3/envs/py39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/mnt/sdc/projects/LLaMA-Factory/src/llmtuner/chat/chat_model.py", line 100, in chat gen_kwargs, prompt_length = self._process_args(messages, system, tools, **input_kwargs) File "/mnt/sdc/projects/LLaMA-Factory/src/llmtuner/chat/chat_model.py", line 45, in _process_args input_ids = torch.tensor([prompt], device=self.model.device) RuntimeError: allocate:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:1406 NPU error, error code is 107002 [Error]: The context is empty. Check whether acl.rt.set_context or acl.rt.set_device is called. EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship. TraceBack (most recent call last): ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

MTMTMTMTMTMTMTMT commented 6 months ago

@ZhuoranLyu 我又看了一下之前的文档，需要手动指定一下设备，对于非流式推理，我在app.py的135行加了torch.npu.set_device("npu:0")可以正常推理了。对于流式推理我更改了chat_model.py的135行为

    def generate_with_npu_setting(**kwargs):
        torch.npu.set_device("npu:0") 
        self.model.generate(**kwargs)

    thread = Thread(target=generate_with_npu_setting, kwargs=gen_kwargs)

感谢您的提醒。我会关闭这个issue。

hiyouga / LLaMA-Factory

昇腾910B推理报错 #2385

Reminder

Reproduction

Expected behavior

System Info

Others