microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.85k stars 2.94k forks source link

CUDA Graph Error - CUDA failure 900: operation not permitted when stream is capturing #15002

Open tianleiwu opened 1 year ago

tianleiwu commented 1 year ago

Describe the issue

During cuda graph catpure, ORT will trigger cudaStreamSynchronize, which is not allowed in CUDA graph catpure. Call stack is like the following:

libonnxruntime_providers_cuda.so!onnxruntime::CudaStream::CleanUpOnRunEnd(onnxruntime::CudaStream * const this) git\onnxruntime\onnxruntime\core\providers\cuda\cuda_stream_handle.cc:141)
onnxruntime_pybind11_state.so!onnxruntime::DeviceStreamCollectionImpl::CleanUp(onnxruntime::DeviceStreamCollectionImpl * const this, bool sync_streams) git\onnxruntime\onnxruntime\core\framework\device_stream_collection.cc:30)
onnxruntime_pybind11_state.so!onnxruntime::DeviceStreamCollection::CleanUp(onnxruntime::DeviceStreamCollection * const this, bool sync_streams) git\onnxruntime\onnxruntime\core\framework\device_stream_collection.cc:113)
onnxruntime_pybind11_state.so!onnxruntime::utils::ExecuteGraph(const onnxruntime::SessionState & session_state, onnxruntime::FeedsFetchesManager & feeds_fetches_manager, gsl::span<OrtValue const, 18446744073709551615> feeds, std::vector<OrtValue, std::allocator<OrtValue> > & fetches, ExecutionMode execution_mode, const bool & terminate_flag, const onnxruntime::logging::Logger & logger, bool sync_execution_provider, bool only_execute_path_to_fetches, onnxruntime::Stream * parent_stream) git\onnxruntime\onnxruntime\core\framework\utils.cc:782)
onnxruntime_pybind11_state.so!onnxruntime::utils::ExecuteGraph(const onnxruntime::SessionState & session_state, onnxruntime::FeedsFetchesManager & feeds_fetches_manager, gsl::span<OrtValue const, 18446744073709551615> feeds, std::vector<OrtValue, std::allocator<OrtValue> > & fetches, ExecutionMode execution_mode, const onnxruntime::RunOptions & run_options, const onnxruntime::logging::Logger & logger) git\onnxruntime\onnxruntime\core\framework\utils.cc:817)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, const onnxruntime::RunOptions & run_options, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615> feed_names, gsl::span<OrtValue const, 18446744073709551615> feeds, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615> output_names, std::vector<OrtValue, std::allocator<OrtValue> > * p_fetches, const std::vector<OrtDevice, std::allocator<OrtDevice> > * p_fetches_device_info) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2001)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, const onnxruntime::RunOptions & run_options, onnxruntime::IOBinding & io_binding) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2155)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, onnxruntime::IOBinding & io_binding) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2160)
onnxruntime_pybind11_state.so!onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)>::operator()(onnxruntime::python::PyInferenceSession *, onnxruntime::SessionIOBinding &, onnxruntime::RunOptions *) const(const onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)> * const __closure, onnxruntime::python::PyInferenceSession * sess, onnxruntime::SessionIOBinding & io_binding, onnxruntime::RunOptions * run_options) git\onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc:1668)
onnxruntime_pybind11_state.so!pybind11::detail::argument_loader<onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, OrtRunOptions*>::call_impl<void, onnxruntime::python::addObjectMethods(pybind11::module&, onnxruntime::Environment&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)>&, 0, 1, 2, pybind11::detail::void_type>(onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)> &, std::index_sequence, pybind11::detail::void_type &&)

Error is like the following (I added file and line):

2023-03-10 11:24:05.061767687 [E:onnxruntime:Default, cuda_call.cc:116 CudaCall] CUDA failure 900: operation not permitted when stream is capturing ; GPU=0 ; hostname=??; file=/work/tlwu/git/onnxruntime/onnxruntime/core/providers/cuda/cuda_stream_handle.cc ; line=141 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(GetHandle()));

To reproduce

The error is not always triggered with small model. But with larger model like unet, it can always reproduce.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

hariharans29 commented 1 year ago

Hi @feihugis - I recall you saying that the model your team flighted also used CUDA Graph. Did you run into issues like the above while trying to capture the graph ? AFAIK - Cuda stream synchronize has always existed in the code. I wonder why we didn't see something like this while testing your model.

hariharans29 commented 1 year ago

@tianleiwu - Could it be that in the "large" unet model, it is using a kernel that internally uses cudaStreamSynchronize() ? This may be one of the cases where we can't use CUDA Graphs unfortunately.

For the "small" model, it may be that the stream synchronize using op/kernel doesn't kick-in? If you look at the CUDA EP setup that captures the graph, we first finish capturing the graph in OnRunEnd() here - https://github.com/microsoft/onnxruntime/blob/a8ad0edbeb45a1733d5b062acc13c6b3ad08731b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc#L387 and only then do the stream sync here - https://github.com/microsoft/onnxruntime/blob/a8ad0edbeb45a1733d5b062acc13c6b3ad08731b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc#L397 before returning control back to the caller.

Unfortunately, if one of the intermediate kernels it encounters between graph capture begin and graph capture end contains synchronization logic, it cannot be captured.

feihugis commented 1 year ago

Hi @feihugis - I recall you saying that the model your team flighted also used CUDA Graph. Did you run into issues like the above while trying to capture the graph ? AFAIK - Cuda stream synchronize has always existed in the code. I wonder why we didn't see something like this while testing your model.

Hi @hariharans29 and @tianleiwu sorry for the late response. I did not see this message and suddenly saw it when I search my email for something else.

Yes, the model we had mainstreamed around one year ago did not meet any issue when capturing the CUDA graph.

Recently when I tried GPT2+Beam Search, I met similar issues. After making some codes changes (https://github.com/feihugis/onnxruntime/commit/de67b88bb775e7700f9a685511f0fab391c24cd6), CUDA Graph capturing can work, but as some of ops are not on GPU, the outputs are not correct.

Please feel free to ping me on Team if I missed your comments.

snnn commented 4 months ago

I still see this error when running multiple models in parallel. You can reproduce the error by running:

./onnx_test_runner -e cuda /data/onnx 

The folder /data/onnx holds test models and their input/output data from https://github.com/onnx/onnx

snnn commented 4 months ago

2024-07-23 16:30:08.420038342 [E:onnxruntime:Default, dataitem_request.cc:32 operator()] argmin_default_axis_random:Non-zero status code returned while running ArgMin node. Name:'' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing 2024-07-23 16:30:08.889316320 [E:onnxruntime:clip, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Clip node. Name:'' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing

tianleiwu commented 4 months ago

@snnn, this issue is for cuda graph error in single thread. Your reported error is another issue of multi-threading.

Stream capturing error shall not appear when cuda graph is not enabled. If you see that error in onnx test runner, that basically means ORT has some code is not thread-safe, which cause buffer overrun and mess up the call stack.