Closed aoyinke closed 4 months ago
还想知道swift版本和torch,torch_npu版本之间的依赖关系,现在看下来是明显有个对应的逻辑
我把swift版本回退到2.1.0之后可以正常微调,但是微调的时候loss和学习率都为0
qwen2 要用bf16微调 可以先试试qwen1.5
qwen2 要用bf16微调 可以先试试qwen1.5
您好!请问lora微调的时候loss总是大于1,这个正常吗
正常的
正常的
好的,非常感谢您的解答!
qwen2 要用bf16微调 可以先试试qwen1.5
训qwen2 loss为0的情况现在修复了吗?
环境
硬件环境:modelarts 910B 八卡 cann7.0.0beta1 软件环境: transformers 4.41.2 ms-swift 2.2.0.dev0 /home/ma-user/work/gengyichen/swift
第一种版本
torch 2.3.1 torch-npu 2.2.0
执行以下代码:
ASCEND_RT_VISIBLE_DEVICES=1 \ swift sft \ --model_id_or_path /home/ma-user/work/gengyichen/models/Qwen2-7B \ --model_type qwen2-7b \ --dataset alpaca-zh#500 alpaca-en#500 kmind-cognition#500 \ --num_train_epochs 5 \ --sft_type lora \ --output_dir output
报错如下:Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 2, in <module> from swift.llm import sft_main File "/home/ma-user/work/gengyichen/swift/swift/llm/__init__.py", line 5, in <module> from .utils import * File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/__init__.py", line 2, in <module> from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, RLHFArguments, File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/argument.py", line 23, in <module> from swift.trainers import Seq2SeqTrainingArguments File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 54, in __getattr__ module = self._get_module(self._class_to_module[name]) File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 66, in _get_module raise RuntimeError( RuntimeError: Failed to import swift.trainers.arguments because of the following error (look up to see its traceback): cannot import name 'TorchVariable' from 'torch._dynamo.variables.torch' (/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_dynamo/variables/torch.py)
第二种版本
torch 2.1.0 torch-npu 2.1.0
执行代码:
ASCEND_RT_VISIBLE_DEVICES=1 \ swift sft \ --model_id_or_path /home/ma-user/work/gengyichen/models/Qwen2-7B \ --model_type qwen2-7b \ --dataset alpaca-zh#500 alpaca-en#500 kmind-cognition#500 \ --num_train_epochs 5 \ --sft_type lora \ --output_dir output
报错如下:
Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 2, in <module> from swift.llm import sft_main File "/home/ma-user/work/gengyichen/swift/swift/llm/__init__.py", line 5, in <module> from .utils import * File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/__init__.py", line 2, in <module> from .argument import (AppUIArguments, DeployArguments, EvalArguments, ExportArguments, InferArguments, RLHFArguments, File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/argument.py", line 24, in <module> from swift.tuners import Swift File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 54, in __getattr__ module = self._get_module(self._class_to_module[name]) File "/home/ma-user/work/gengyichen/swift/swift/utils/import_utils.py", line 66, in _get_module raise RuntimeError( RuntimeError: Failed to import swift.tuners.base because of the following error (look up to see its traceback): cannot import name 'packaging' from 'pkg_resources' (/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/pkg_resources/__init__.py)
第三种情况
torch 2.2.0 torch_npu 2.2.0
执行代码:
ASCEND_RT_VISIBLE_DEVICES=1 \ swift sft \ --model_id_or_path /home/ma-user/work/gengyichen/models/Qwen2-7B \ --model_type qwen2-7b \ --dataset alpaca-zh#500 alpaca-en#500 kmind-cognition#500 \ --num_train_epochs 5 \ --sft_type lora \ --output_dir output
报错如下: ` [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. Train: 0%| | 0/465 [00:00<?, ?it/s][W AclInterface.cpp:181] Warning: 0Failed to find function aclrtCreateEventExWithFlag (function operator()) [E OpParamMaker.cpp:273] call aclnnMaskedSelect failed, detail:EZ9999: Inner Error! EZ9999 The error from device(chipId:0, dieId:0), serial number is 12, there is an fftsplus aivector error exception, core id is 1, error code = 0, dump info: pc start: 0x1245ddfd6a04, current: 0x1245ddfddf80, vec error info: 0x77000000bd, mte error info: 0x50ac22889, ifu error info: 0x3302628b2e880, ccu error info: 0x67b12e0050b223f9, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100344080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1164] TraceBack (most recent call last): The extend info: errcode:(0, 0x800, 0) errorStr: The UB address accessed by the VEC instruction is not aligned. fixp_error0 info: 0xac22889, fixp_error1 info: 0x5 fsmId:0, tslot:3, thread:0, ctxid:0, blk:3, sublk:1, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1176] Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1677] AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1454] Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=29509, flip_num=0, fault kernel_name=FlashAttentionScore_213e8781b9323773f21103b53d6e8517_high_performance_10000000000000203943_mix_aic, program id=25, hash=3610081123233066155.[FUNC:GetError][FILE:stream.cc][LINE:1454] [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1454] rtStreamSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_messagemanage.cc][LINE:50] Assert ((rtStreamSynchronize(stream)) == 0) failed Assert ((extInfoHandle->UpdateOutputShapeFromExtInfo(outputs_, stream)) == OK) failed launch failed for MaskedSelect, errno:361001.
[ERROR] 2024-06-27-11:27:18 (PID:2486090, Device:0, RankID:-1) ERR01005 OPS internal error Exception raised from operator() at third_party/op-plugin/op_plugin/ops/base_ops/opapi/MaskedSelectKernelNpuOpApi.cpp:49 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x68 (0xfffef7fa8538 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x6c (0xfffef7f558a0 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: + 0xa78b90 (0xfffd0bbafb90 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0xe2696c (0xfffd0bf5d96c in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x56b9f0 (0xfffd0b6a29f0 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x56be18 (0xfffd0b6a2e18 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x569e20 (0xfffd0b6a0e20 in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0xafe0c (0xfffef7fdae0c in /home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: + 0x7a80 (0xffffba76ba80 in /lib64/libpthread.so.0)
frame #9: + 0xe4d0c (0xffffba59fd0c in /lib64/libc.so.6)
Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 5, in
sft_main()
File "/home/ma-user/work/gengyichen/swift/swift/utils/run_utils.py", line 27, in x_main
result = llm_x(args, *kwargs)
File "/home/ma-user/work/gengyichen/swift/swift/llm/sft.py", line 310, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "/home/ma-user/work/gengyichen/swift/swift/trainers/mixin.py", line 517, in train
res = super().train(resume_from_checkpoint, args, **kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ma-user/work/gengyichen/swift/swift/trainers/trainers.py", line 231, in compute_loss
acc = (torch.masked_select(preds, masks) == torch.masked_select(labels, masks)).float().mean()
RuntimeError: The Inner error is reported as above.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[ERROR] 2024-06-27-11:27:18 (PID:2486090, Device:0, RankID:-1) ERR00100 PTA call acl api failed`
在添加了这个环境变量之后
export ASCEND_LAUNCH_BLOCKING=1
报错如下: `[2024-06-27 11:39:22,652] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. Train: 0%| | 0/465 [00:00<?, ?it/s][W AclInterface.cpp:181] Warning: 0Failed to find function aclrtCreateEventExWithFlag (function operator()) Traceback (most recent call last): File "/home/ma-user/work/gengyichen/swift/swift/cli/sft.py", line 5, in
sft_main()
File "/home/ma-user/work/gengyichen/swift/swift/utils/run_utils.py", line 27, in x_main
result = llm_x(args, kwargs)
File "/home/ma-user/work/gengyichen/swift/swift/llm/sft.py", line 310, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "/home/ma-user/work/gengyichen/swift/swift/trainers/mixin.py", line 517, in train
res = super().train(resume_from_checkpoint, args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ma-user/work/gengyichen/swift/swift/trainers/trainers.py", line 183, in compute_loss
outputs = model(inputs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
return model_forward(*args, *kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call
return convert_to_fp32(self.model_forward(args, kwargs))
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward
return self.base_model(
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
return self.model.forward(*args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1149, in forward
outputs = self.model(
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1024, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/ma-user/work/gengyichen/swift/swift/llm/utils/model.py", line 4836, in
_old_checkpoint(*args, use_reentrant=use_reentrant, kwargs))
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, *kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
ret = function(*args, *kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 748, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(args, kwargs)
File "/home/ma-user/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 679, in forward
attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: Sync:torch_npu/csrc/framework/OpCommand.cpp:190 NPU error, error code is 507015
[ERROR] 2024-06-27-11:39:29 (PID:2502349, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The aicore execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EZ9999: Inner Error!
EZ9999 The error from device(chipId:0, dieId:0), serial number is 14, there is an fftsplus aivector error exception, core id is 40, error code = 0, dump info: pc start: 0x1245ddfd6a04, current: 0x1245ddfddef4, vec error info: 0x77000000bd, mte error info: 0xeeefb3eb7f, ifu error info: 0x671dbfb03d2c0, ccu error info: 0xf041fdd099ae040, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100140080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1164]
TraceBack (most recent call last):
The extend info: errcode:(0, 0x800, 0) errorStr: The UB address accessed by the VEC instruction is not aligned. fixp_error0 info: 0xfb3eb7f, fixp_error1 info: 0xee fsmId:0, tslot:3, thread:0, ctxid:0, blk:18, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1176]
Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1677]
AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1454]
Aicore kernel execute failed, device_id=0, stream_id=6, report_stream_id=6, task_id=29509, flip_num=0, fault kernel_name=FlashAttentionScore_213e8781b9323773f21103b53d6e8517_high_performance_10000000000000203943_mix_aic, program id=25, hash=16361232828947559013.[FUNC:GetError][FILE:stream.cc][LINE:1454]
[AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1454]
rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
[W NPUStream.cpp:382] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeUsedDevices) EH9999: Inner Error! rtEventQueryWaitStatus execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 [Query][Status]query event wait-status failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last):
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50] EH9999 wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) Train: 0%| | 0/465 [00:09<?, ?it/s]`
想要的效果
qwen1.5,2.0能够正常微调