Closed MTMTMTMTMTMTMTMT closed 6 months ago
单卡推理呢,指定环境变量
单卡推理呢,指定环境变量
我看代码中读的环境变量是LOCAL_RANK,这个我也试过,也没有效果。我手动指定"npu:0"也不行。
应该是 CUDA_VISIBLE_DEVICES
应该是 CUDA_VISIBLE_DEVICES
感谢回复,下面是加上环境变量后的日志 命令
CUDA_VISIBLE_DEVICES=0 python src/cli_demo.py \
--model_name_or_path /mnt/sdc/models/Qwen-1_8B-Chat \
--template qwen
@hiyouga
我参考#I6KS6A在模型分发前设置torch_npu.npu.set_device('npu:0')
但也无济于事。
import torch_npu
from torch_npu.contrib import transfer_to_npu
model = AutoModelForCausalLM.from_pretrained(model_weight_path, device_map="npu:0", torch_dtype=torch.bfloat16, trust_remote_code=True)
这样试试
@70557dzqc 还是一样的问题
RuntimeError: allocate:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:1406 NPU error, error code is 107002
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
请问这个有解吗?mindformers太难用了
请问这个有解吗?mindformers太难用了
话说不应该用mindspore来改写torch吗
别用流式输出就行了,stream_chat那边起了一个新线程,你要在新线程里面set_device一下,或者干脆就用chat就行了
@ZhuoranLyu 感谢回复,按照您的建议,我启动了api服务,stream参数为false,推理的错误还是与之前一致。日志如下
@ZhuoranLyu
我又看了一下之前的文档,需要手动指定一下设备,对于非流式推理,我在app.py的135行加了torch.npu.set_device("npu:0")
可以正常推理了。对于流式推理我更改了chat_model.py的135行为
def generate_with_npu_setting(**kwargs):
torch.npu.set_device("npu:0")
self.model.generate(**kwargs)
thread = Thread(target=generate_with_npu_setting, kwargs=gen_kwargs)
感谢您的提醒。 我会关闭这个issue。
Reminder
Reproduction
系统及版本:
运行参数:
日志
[INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file qwen.tiktoken [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2024-01-31 06:52:54,128 >> loading file tokenizer.json [INFO|configuration_utils.py:727] 2024-01-31 06:52:54,786 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:727] 2024-01-31 06:52:54,788 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/config.json [INFO|configuration_utils.py:792] 2024-01-31 06:52:54,789 >> Model config QWenConfig { "_name_or_path": "/mnt/sdc/models/Qwen-1_8B-Chat/", "architectures": [ "QWenLMHeadModel" ], "attn_dropout_prob": 0.0, "auto_map": { "AutoConfig": "configuration_qwen.QWenConfig", "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel" }, "bf16": false, "emb_dropout_prob": 0.0, "fp16": false, "fp32": false, "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 11008, "kv_channels": 128, "layer_norm_epsilon": 1e-06, "max_position_embeddings": 8192, "model_type": "qwen", "no_bias": true, "num_attention_heads": 16, "num_hidden_layers": 24, "onnx_safe": null, "rotary_emb_base": 10000, "rotary_pct": 1.0, "scale_attn_weights": true, "seq_length": 8192, "softmax_in_fp32": false, "tie_word_embeddings": false, "tokenizer_class": "QWenTokenizer", "transformers_version": "4.37.2", "use_cache": true, "use_cache_kernel": false, "use_cache_quantization": false, "use_dynamic_ntk": true, "use_flash_attn": "auto", "use_logn_attn": true, "vocab_size": 151936 } [INFO|modeling_utils.py:3473] 2024-01-31 06:52:54,831 >> loading weights file /mnt/sdc/models/Qwen-1_8B-Chat/model.safetensors.index.json [INFO|modeling_utils.py:1426] 2024-01-31 06:52:54,831 >> Instantiating QWenLMHeadModel model under default dtype torch.float16. [INFO|configuration_utils.py:826] 2024-01-31 06:52:54,832 >> Generate config GenerationConfig {} Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.63it/s] [INFO|modeling_utils.py:4350] 2024-01-31 06:52:56,534 >> All model checkpoint weights were used when initializing QWenLMHeadModel. [INFO|modeling_utils.py:4358] 2024-01-31 06:52:56,534 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at /mnt/sdc/models/Qwen-1_8B-Chat/. If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training. [INFO|configuration_utils.py:779] 2024-01-31 06:52:56,537 >> loading configuration file /mnt/sdc/models/Qwen-1_8B-Chat/generation_config.json [INFO|configuration_utils.py:826] 2024-01-31 06:52:56,538 >> Generate config GenerationConfig { "chat_format": "chatml", "do_sample": true, "eos_token_id": 151643, "max_new_tokens": 512, "max_window_size": 6144, "pad_token_id": 151643, "repetition_penalty": 1.1, "top_k": 0, "top_p": 0.8 } 01/31/2024 06:52:56 - INFO - llmtuner.model.adapter - Adapter is not found at evaluation, load the base model. 01/31/2024 06:52:56 - INFO - llmtuner.model.loader - trainable params: 0 || all params: 1836828672 || trainable%: 0.0000 01/31/2024 06:52:56 - INFO - llmtuner.model.loader - This IS expected that the trainable params is 0 if you are using model for inference only. 01/31/2024 06:53:07 - INFO - llmtuner.data.template - Add eos token: <|endoftext|> 01/31/2024 06:53:07 - INFO - llmtuner.data.template - Add pad token: <|endoftext|> 01/31/2024 06:53:07 - INFO - llmtuner.data.template - Replace eos token: <|im_end|> Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application. User: nihao Assistant: Exception in thread Thread-7: Traceback (most recent call last): File "/root/anaconda3/envs/mind/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/root/anaconda3/envs/mind/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/utils.py", line 1349, in generate model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation( File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/generation/utils.py", line 449, in _prepare_attention_mask_for_generation is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs) File "/root/anaconda3/envs/mind/lib/python3.9/site-packages/torch/_tensor.py", line 1059, in __contains__ return (element == self).any().item() # type: ignore[union-attr] RuntimeError: allocate:/usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:1406 NPU error, error code is 107002 [Error]: The context is empty. Check whether acl.rt.set_context or acl.rt.set_device is called. EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship. TraceBack (most recent call last): ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4290] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null] Traceback (most recent call last): File "/mnt/sdc/projects/LLaMA-Factory/src/cli_demo.py", line 49, inExpected behavior
可以正常输出推理内容 参考这篇博客说要手动指定npu,但我不知道要在哪里手动指定,我尝试过在src/llmtuner/extras/misc.py 135行处手动指定npu,但没有效果。
System Info
transformers
version: 4.37.2Others
单卡训练是可以的
日志
01/31/2024 06:50:43 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training. [INFO|training_args.py:1828] 2024-01-31 06:50:43,618 >> PyTorch: setting up devices /root/anaconda3/envs/mind/lib/python3.9/site-packages/transformers/training_args.py:1741: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead. warnings.warn( 01/31/2024 06:50:43 - INFO - llmtuner.hparams.parser - Process rank: 0, device: npu:0, n_gpu: 1 distributed training: True, compute dtype: torch.float16 01/31/2024 06:50:43 - INFO - llmtuner.hparams.parser - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=