两张V100微调qwen2-7b，单卡微调正常，双卡微调出现RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

整体报错如下： \Users\Administrator\miniconda3\envs\python39\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py:580: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.) attn_output = torch.nn.functional.scaled_dot_product_attention( Traceback (most recent call last):

File ~\miniconda3\envs\python39\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec exec(code, globals, locals)

File d:\ljh\data\未命名0.py:44 result = sft_main(sft_args)

File ~\miniconda3\envs\python39\lib\site-packages\swift\utils\run_utils.py:32 in x_main result = llm_x(args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\swift\llm\sft.py:417 in llm_sft trainer.train(training_args.resume_from_checkpoint)

File ~\miniconda3\envs\python39\lib\site-packages\swift\trainers\mixin.py:552 in train res = super().train(resume_from_checkpoint, *args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\transformers\trainer.py:1938 in train return inner_training_loop(

File ~\miniconda3\envs\python39\lib\site-packages\transformers\trainer.py:2279 in _inner_training_loop tr_loss_step = self.training_step(model, inputs)

File ~\miniconda3\envs\python39\lib\site-packages\transformers\trainer.py:3318 in training_step loss = self.compute_loss(model, inputs)

File ~\miniconda3\envs\python39\lib\site-packages\swift\trainers\trainers.py:165 in compute_loss outputs = model(**inputs)

File ~\miniconda3\envs\python39\lib\site-packages\torch\nn\modules\module.py:1553 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\torch\nn\modules\module.py:1603 in _call_impl result = forward_call(*args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\accelerate\utils\operations.py:819 in forward return model_forward(*args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\accelerate\utils\operations.py:807 in call return convert_to_fp32(self.model_forward(*args, **kwargs))

File ~\miniconda3\envs\python39\lib\site-packages\torch\amp\autocast_mode.py:43 in decorate_autocast return func(*args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\peft\peft_model.py:1577 in forward return self.base_model(

File ~\miniconda3\envs\python39\lib\site-packages\torch\nn\modules\module.py:1553 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\torch\nn\modules\module.py:1562 in _call_impl return forward_call(*args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\peft\tuners\tuners_utils.py:188 in forward return self.model.forward(*args, **kwargs)

File ~\miniconda3\envs\python39\lib\site-packages\accelerate\hooks.py:170 in new_forward return module._hf_hook.post_forward(module, output)

File ~\miniconda3\envs\python39\lib\site-packages\accelerate\hooks.py:387 in post_forward output = send_to_device(output, self.input_device, skip_keys=self.skip_keys)

File ~\miniconda3\envs\python39\lib\site-packages\accelerate\utils\operations.py:183 in send_to_device {

File ~\miniconda3\envs\python39\lib\site-packages\accelerate\utils\operations.py:184 in k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)

File ~\miniconda3\envs\python39\lib\site-packages\accelerate\utils\operations.py:155 in send_to_device return tensor.to(device, non_blocking=non_blocking)

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [14,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed. C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Loss.cu:250: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed.

modelscope / ms-swift

两张V100微调qwen2-7b，单卡微调正常，双卡微调出现RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #1882