RuntimeError: Sync:torch_npu/csrc/framework/OpCommand.cpp:190 NPU error, error code is 507015 昇腾显卡910B qwen1.5,2.0微调报错

aoyinke commented 4 months ago

环境

硬件环境：modelarts 910B 八卡 cann7.0.0beta1 软件环境： transformers 4.41.2 ms-swift 2.2.0.dev0 /home/ma-user/work/gengyichen/swift

第一种版本

torch 2.3.1 torch-npu 2.2.0

执行以下代码： ASCEND_RT_VISIBLE_DEVICES=1 \ swift sft \ --model_id_or_path /home/ma-user/work/gengyichen/models/Qwen2-7B \ --model_type qwen2-7b \ --dataset alpaca-zh#500 alpaca-en#500 kmind-cognition#500 \ --num_train_epochs 5 \ --sft_type lora \ --output_dir output 报错如下： Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 2, in <module> from swift.llm import sft_main File "/home/ma-user/work/gengyichen/swift/swift/llm/__init__.py", line 5, in <module> from .utils import * File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/__init__.py", line 2, in <module> from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, RLHFArguments, File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/argument.py", line 23, in <module> from swift.trainers import Seq2SeqTrainingArguments File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 54, in __getattr__ module = self._get_module(self._class_to_module[name]) File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 66, in _get_module raise RuntimeError( RuntimeError: Failed to import swift.trainers.arguments because of the following error (look up to see its traceback): cannot import name 'TorchVariable' from 'torch._dynamo.variables.torch' (/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py)

第二种版本

torch 2.1.0 torch-npu 2.1.0

执行代码： ASCEND_RT_VISIBLE_DEVICES=1 \ swift sft \ --model_id_or_path /home/ma-user/work/gengyichen/models/Qwen2-7B \ --model_type qwen2-7b \ --dataset alpaca-zh#500 alpaca-en#500 kmind-cognition#500 \ --num_train_epochs 5 \ --sft_type lora \ --output_dir output

报错如下： Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 2, in <module> from swift.llm import sft_main File "/home/ma-user/work/gengyichen/swift/swift/llm/__init__.py", line 5, in <module> from .utils import * File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/__init__.py", line 2, in <module> from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, RLHFArguments, File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/argument.py", line 24, in <module> from swift.tuners import Swift File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 54, in __getattr__ module = self._get_module(self._class_to_module[name]) File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 66, in _get_module raise RuntimeError( RuntimeError: Failed to import swift.tuners.base because of the following error (look up to see its traceback): cannot import name 'packaging' from 'pkg_resources' (/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/pkg_resources/__init__.py)

第三种情况

torch 2.2.0 torch_npu 2.2.0

qwen1.5,2.0都可以正常推理
qwen也可以微调
但是qwen1.5,2.0微调报错

执行代码： ASCEND_RT_VISIBLE_DEVICES=1 \ swift sft \ --model_id_or_path /home/ma-user/work/gengyichen/models/Qwen2-7B \ --model_type qwen2-7b \ --dataset alpaca-zh#500 alpaca-en#500 kmind-cognition#500 \ --num_train_epochs 5 \ --sft_type lora \ --output_dir output

报错如下： ` [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. Train: 0%| | 0/465 [00:00<?, ?it/s][W AclInterface.cpp:181] Warning: 0Failed to find function aclrtCreateEventExWithFlag (function operator()) [E OpParamMaker.cpp:273] call aclnnMaskedSelect failed, detail:EZ9999: Inner Error! EZ9999 The error from device(chipId:0, dieId:0), serial number is 12, there is an fftsplus aivector error exception, core id is 1, error code = 0, dump info: pc start: 0x1245ddfd6a04, current: 0x1245ddfddf80, vec error info: 0x77000000bd, mte error info: 0x50ac22889, ifu error info: 0x3302628b2e880, ccu error info: 0x67b12e0050b223f9, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100344080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1164] TraceBack (most recent call last): The extend info: errcode:(0, 0x800, 0) errorStr: The UB address accessed by the VEC instruction is not aligned. fixp_error0 info: 0xac22889, fixp_error1 info: 0x5 fsmId:0, tslot:3, thread:0, ctxid:0, blk:3, sublk:1, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1176] Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1677] AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1454] Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=29509, flip_num=0, fault kernel_name=FlashAttentionScore_213e8781b9323773f21103b53d6e8517_high_performance_10000000000000203943_mix_aic, program id=25, hash=3610081123233066155.[FUNC:GetError][FILE:stream.cc][LINE:1454] [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1454] rtStreamSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_messagemanage.cc][LINE:50] Assert ((rtStreamSynchronize(stream)) == 0) failed Assert ((extInfoHandle->UpdateOutputShapeFromExtInfo(outputs_, stream)) == OK) failed launch failed for MaskedSelect, errno:361001.

[ERROR] 2024-06-27-11:27:18 (PID:2486090, Device:0, RankID:-1) ERR01005 OPS internal error Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/MaskedSelectKernelNpuOpApi.cpp:49 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xfffef7fa8538 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x6c (0xfffef7f558a0 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: + 0xa78b90 (0xfffd0bbafb90 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: + 0xe2696c (0xfffd0bf5d96c in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #4: + 0x56b9f0 (0xfffd0b6a29f0 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #5: + 0x56be18 (0xfffd0b6a2e18 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #6: + 0x569e20 (0xfffd0b6a0e20 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so) frame #7: + 0xafe0c (0xfffef7fdae0c in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/lib/libc10.so) frame #8: + 0x7a80 (0xffffba76ba80 in /lib64/libpthread.so.0) frame #9: + 0xe4d0c (0xffffba59fd0c in /lib64/libc.so.6)

Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 5, in sft_main() File "/home/ma-user/work/gengyichen/swift/swift/utils/run_utils.py", line 27, in x_main result = llm_x(args, *kwargs) File "/home/ma-user/work/gengyichen/swift/swift/llm/sft.py", line 310, in llm_sft trainer.train(training_args.resume_from_checkpoint) File "/home/ma-user/work/gengyichen/swift/swift/trainers/mixin.py", line 517, in train res = super().train(resume_from_checkpoint, args, **kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step loss = self.compute_loss(model, inputs) File "/home/ma-user/work/gengyichen/swift/swift/trainers/trainers.py", line 231, in compute_loss acc = (torch.masked_select(preds, masks) == torch.masked_select(labels, masks)).float().mean() RuntimeError: The Inner error is reported as above. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1. [ERROR] 2024-06-27-11:27:18 (PID:2486090, Device:0, RankID:-1) ERR00100 PTA call acl api failed`

在添加了这个环境变量之后 export ASCEND_LAUNCH_BLOCKING=1

报错如下： `[2024-06-27 11:39:22,652] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. Train: 0%| | 0/465 [00:00<?, ?it/s][W AclInterface.cpp:181] Warning: 0Failed to find function aclrtCreateEventExWithFlag (function operator()) Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 5, in sft_main() File "/home/ma-user/work/gengyichen/swift/swift/utils/run_utils.py", line 27, in x_main result = llm_x(args, kwargs) File "/home/ma-user/work/gengyichen/swift/swift/llm/sft.py", line 310, in llm_sft trainer.train(training_args.resume_from_checkpoint) File "/home/ma-user/work/gengyichen/swift/swift/trainers/mixin.py", line 517, in train res = super().train(resume_from_checkpoint, args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step loss = self.compute_loss(model, inputs) File "/home/ma-user/work/gengyichen/swift/swift/trainers/trainers.py", line 183, in compute_loss outputs = model(inputs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward return model_forward(*args, *kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward return self.base_model( File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward return self.model.forward(*args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1149, in forward outputs = self.model( File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1024, in forward layer_outputs = self._gradient_checkpointing_func( File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/model.py", line 4836, in _old_checkpoint(*args, use_reentrant=use_reentrant, kwargs)) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, *kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint ret = function(*args, *kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 748, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 679, in forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: Sync:torch_npu/csrc/framework/OpCommand.cpp:190 NPU error, error code is 507015 [ERROR] 2024-06-27-11:39:29 (PID:2502349, Device:0, RankID:-1) ERR00100 PTA call acl api failed [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EZ9999: Inner Error! EZ9999 The error from device(chipId:0, dieId:0), serial number is 14, there is an fftsplus aivector error exception, core id is 40, error code = 0, dump info: pc start: 0x1245ddfd6a04, current: 0x1245ddfddef4, vec error info: 0x77000000bd, mte error info: 0xeeefb3eb7f, ifu error info: 0x671dbfb03d2c0, ccu error info: 0xf041fdd099ae040, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100140080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1164] TraceBack (most recent call last): The extend info: errcode:(0, 0x800, 0) errorStr: The UB address accessed by the VEC instruction is not aligned. fixp_error0 info: 0xfb3eb7f, fixp_error1 info: 0xee fsmId:0, tslot:3, thread:0, ctxid:0, blk:18, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1176] Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1677] AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1454] Aicore kernel execute failed, device_id=0, stream_id=6, report_stream_id=6, task_id=29509, flip_num=0, fault kernel_name=FlashAttentionScore_213e8781b9323773f21103b53d6e8517_high_performance_10000000000000203943_mix_aic, program id=25, hash=16361232828947559013.[FUNC:GetError][FILE:stream.cc][LINE:1454] [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1454] rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

[W NPUStream.cpp:382] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeUsedDevices) EH9999: Inner Error! rtEventQueryWaitStatus execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 [Query][Status]query event wait-status failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last):

[W NPUStream.cpp:365] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) Train: 0%| | 0/465 [00:09<?, ?it/s]`

想要的效果

qwen1.5,2.0能够正常微调

aoyinke commented 4 months ago

还想知道swift版本和torch,torch_npu版本之间的依赖关系，现在看下来是明显有个对应的逻辑

aoyinke commented 4 months ago

我把swift版本回退到2.1.0之后可以正常微调，但是微调的时候loss和学习率都为0

Jintao-Huang commented 4 months ago

qwen2 要用bf16微调可以先试试qwen1.5

aoyinke commented 4 months ago

qwen2 要用bf16微调可以先试试qwen1.5

您好！请问lora微调的时候loss总是大于1，这个正常吗

Jintao-Huang commented 4 months ago

正常的

aoyinke commented 4 months ago

正常的

好的，非常感谢您的解答！

cailigao commented 2 months ago

qwen2 要用bf16微调可以先试试qwen1.5

训qwen2 loss为0的情况现在修复了吗？

modelscope / ms-swift