OS platform and distribution (e.g., Linux Ubuntu 20.04):
MindNLP version (master branch latest):
Describe the current behavior
# clone mindnlp and insatll
git clone git@github.com:mindspore-lab/mindnlp.git
cd mindnlp
scripts/build_and_reinstall.sh
# run ut which call batchmatmul
export RUN_SLOW=1
pytest tests/ut/transformers/models/gpt_bigcode/test_modeling_gpt_bigcode.py::GPTBigCodeModelLanguageGenerationTest
runtime error
FAILED tests/ut/transformers/models/gpt_bigcode/test_modeling_gpt_bigcode.py::GPTBigCodeModelLanguageGenerationTest::test_generate_batched - RuntimeError:
FAILED tests/ut/transformers/models/gpt_bigcode/test_modeling_gpt_bigcode.py::GPTBigCodeModelLanguageGenerationTest::test_generate_simple - RuntimeError: SyncHostToDevice failed
================================================================== 2 failed, 3 warnings in 65.51s (0:01:05) ===================================================================
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.157.404 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:200] SyncStream] cudaStreamSynchronize failed, ret[710], device-side assert triggered
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.157.427 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:202] SyncStream] The kernel name and backtrace in log might be incorrect, since CUDA error might be asynchronously reported at some other function call. Please exporting CUDA_LAUNCH_BLOCKING=1 for more accurate error positioning.
[ERROR] ME(3109200,7f02ee036680,python):2023-12-07-19:25:55.157.441 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:540] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.085 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:191] DestroyStream] cudaStreamDestroy failed, ret[710], device-side assert triggered
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.108 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:67] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.642 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:74] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.847 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.858 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:488] operator()] Free device memory[0x7efd1e000000] error.
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.871 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.878 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:488] operator()] Free device memory[0x7efc38000000] error.
Environment
Hardware Environment(
Ascend
/GPU
/CPU
):Software Environment:
Describe the current behavior
Log file
Please refer to files below fail.log