mindspore-ai / mindspore

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
https://gitee.com/mindspore/mindspore
Apache License 2.0
4.26k stars 705 forks source link

`BatchMatMul` kernel runtime error #256

Closed neoming closed 10 months ago

neoming commented 10 months ago

Environment

Hardware Environment(Ascend/GPU/CPU):

/device gpu

Software Environment:

Describe the current behavior

# clone mindnlp and insatll
git clone git@github.com:mindspore-lab/mindnlp.git
cd mindnlp
scripts/build_and_reinstall.sh

# run ut which call batchmatmul
export RUN_SLOW=1
pytest tests/ut/transformers/models/gpt_bigcode/test_modeling_gpt_bigcode.py::GPTBigCodeModelLanguageGenerationTest

runtime error

FAILED tests/ut/transformers/models/gpt_bigcode/test_modeling_gpt_bigcode.py::GPTBigCodeModelLanguageGenerationTest::test_generate_batched - RuntimeError: 
FAILED tests/ut/transformers/models/gpt_bigcode/test_modeling_gpt_bigcode.py::GPTBigCodeModelLanguageGenerationTest::test_generate_simple - RuntimeError: SyncHostToDevice failed
================================================================== 2 failed, 3 warnings in 65.51s (0:01:05) ===================================================================
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.157.404 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:200] SyncStream] cudaStreamSynchronize failed, ret[710], device-side assert triggered
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.157.427 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:202] SyncStream] The kernel name and backtrace in log might be incorrect, since CUDA error might be asynchronously reported at some other function call. Please exporting CUDA_LAUNCH_BLOCKING=1 for more accurate error positioning.
[ERROR] ME(3109200,7f02ee036680,python):2023-12-07-19:25:55.157.441 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:540] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.085 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:191] DestroyStream] cudaStreamDestroy failed, ret[710], device-side assert triggered
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.108 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:67] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.642 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:74] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.847 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.858 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:488] operator()] Free device memory[0x7efd1e000000] error.
[ERROR] DEVICE(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.871 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(3109200,7f02ee036680,python):2023-12-07-19:25:55.170.878 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:488] operator()] Free device memory[0x7efc38000000] error.

Log file

Please refer to files below fail.log