Multi-threaded GPU inferencing failing with whisper-small: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node

david-sitsky commented 3 months ago

Describe the issue

I created a Whisper ONNX model using https://github.com/microsoft/Olive/blob/main/examples/whisper/README.md, specifically using these commands on a machine with a GPU:

cd Olive/examples/whisper
python3 prepare_whisper_configs.py --model_name openai/whisper-small --multilingual --no_audio_decoder

olive run --config whisper_gpu_fp32.json --setup
olive run --config whisper_gpu_fp32.json

In my application, I break up large audio files into chunks and then execute each chunk against the Whisper model for transcription using threads.

On a CPU machine (with a model generated on a non-GPU machine) this works fine. However on a GPU machine this fails.

To reproduce

On a g5.4xlarge instance in AWS, using Ubuntu 22.04, CUDA 11.8, Nvidia A10G GPU, the following program reproduces the error:

from multiprocessing.pool import ThreadPool
import onnxruntime
from onnxruntime_extensions import get_library_path as _lib_path
import numpy as np
import librosa

ort_session_options = onnxruntime.SessionOptions()
ort_session_options.register_custom_ops_library(_lib_path())

ort_session = onnxruntime.InferenceSession("whisper_small.onnx", ort_session_options, providers=['CUDAExecutionProvider'])

def infer(i):
    audio_blob, _ = librosa.load("speech.wav")
    inputs = {
        "audio_pcm": np.expand_dims(audio_blob, axis=0),
        "max_length": np.asarray([200], dtype=np.int32),
        "min_length": np.asarray([0], dtype=np.int32),
        "num_beams": np.asarray([2], dtype=np.int32),
        "num_return_sequences": np.asarray([1], dtype=np.int32),
        "length_penalty": np.asarray([1.0], dtype=np.float32),
        "repetition_penalty": np.asarray([1.0], dtype=np.float32),
        "decoder_input_ids": np.expand_dims(np.asarray([50258, 50360], dtype=np.int32), axis=0)
    }

    return ort_session.run(None, inputs)[0]

THREAD_NUMBER=10
pool = ThreadPool(THREAD_NUMBER)

for i in range(10):
    pool.map(infer, range(THREAD_NUMBER))
    pool.join
    print(f"Iteration: {i} done")

pool.close

An example of an error run:

$ python3 tryme.py 
2024-07-18 23:30:42.174926508 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph_beam-search-test_og_BpeDecoder_2 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-18 23:30:42.176827822 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 4 Memcpy nodes are added to the graph decoder subgraph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-18 23:30:42.182986135 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-18 23:30:42.183008286 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-07-18 23:30:46.346349145 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
2024-07-18 23:30:46.346977146 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
Traceback (most recent call last):
  File "/home/ubuntu/ugh/tryme.py", line 33, in <module>
    pool.map(infer, range(THREAD_NUMBER))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/ubuntu/ugh/tryme.py", line 27, in infer
    ort_session.run(None, inputs)[0]
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running DecoderMaskedMultiHeadAttention node. Name:'Attention_0' Status Message: Input sequence length should be 1 to use DecoderMaskedMultiHeadAttention. Actual length is 413
Exception ignored in: <function Pool.__del__ at 0x7ae33d8b57e0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 271, in __del__
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 371, in put
AttributeError: 'NoneType' object has no attribute 'dumps'

The program works fine when THREAD_NUMBER=1.

The speech.wav file can be downloaded from https://resources.djl.ai/audios/speech.wav.

If it helps, I can try and put the model I generated somewhere, but it is 1.1G in size.

Urgency

This is a blocker for deploying our application so it is urgent. We are actually using the Java bindings of onnxruntime with https://github.com/deepjavalibrary/djl-serving which is hitting this very issue when it receives concurrent requests. I wrote the python program so that it is easier to reproduce, but it is exactly the same error message.

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu-1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

david-sitsky commented 3 months ago

To save time, the whisper_small.onnx file can be downloaded from here: https://drive.google.com/file/d/10yDz-VI-iKsszNgyOjbRqYTfwUHap_a-/view?usp=sharing.

tianleiwu commented 3 months ago

It is a known issue that some operators are not thread-safe. Like Attention and MultiHeadAttention used in whisper encoder is not thread safe. You may try set an environment variable ORT_DISABLE_FUSED_ATTENTION=1. However, that will increase latency since some fused attention kernel are disabled.

Another walkaround is to use different session per thread, that could avoid thread safe issue, but probably won't help performance since multiple session competing the same GPU resource.

What's the reason to use use multiple threading for same session? Usually have no performance benefit. Instead, you can try increasing batch size to see whether it could increase throughput.

david-sitsky commented 3 months ago

Setting those environment variables did not help sadly. I still see similar errors, but also new ones that look even worse:

2024-07-19 05:32:03.173536756 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=312 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop, UseTF32()); 
Aborted (core dumped)

I definitely want to use the same session so the model is only loaded once into GPU memory. It is true batching can be used here instead and I'll look into that.

I was initially looking at threading since the whisper model pre-processing, which reads the audio data and converts it into the appropriate format (padding, then doing log mel spectrogram conversion) is CPU work only which could be parallelised. With a batching approach this pre-processing work will be serialised.

The documents I read indicated that the ORT session is thread-safe. Since this is not the case, is there a list of operators documented somewhere that are not?

tianleiwu commented 3 months ago

@david-sitsky, you may try use multiple threading for pre-processing in CPU to see whether it helps.

The new error indicates that other places have thread-safe issue in CUDA provider. It may take time to nail the root cause.

A related older issue: https://github.com/microsoft/onnxruntime/issues/18806

david-sitsky commented 2 months ago

@tianleiwu - any ideas with next steps with the thread-safe issue? I am using djl-serving, so sometimes the server will receive multiple unrelated requests from different clients on the same Whisper model, and I see the issue being hit pretty easily. I am using OnnxRuntime 1.17.3.

2024-08-06 04:54:13.828882272 [E:onnxruntime:ort-java, cuda_call.cc:116 CudaCall] CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

2024-08-06 04:54:13.828924164 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

2024-08-06 04:54:13.829005579 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running WhisperBeamSearch node. Name:'BeamSearch_node' Status Message: Non-zero status code returned while running MatMul node. Name:'/whisper_decoder_init/proj_out/MatMul' Status Message: CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=ip-172-31-31-23 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/math/matmul.cc ; line=324 ; expr=cublasGemmHelper( GetCublasHandle(ctx), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->MutableData<T>()), ldc, device_prop); 

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'

tianleiwu commented 2 months ago

@david-sitsky, The BeamSearch operator is not thread safe since it has internal state.

One way for serving is to have a queue for user's requests, and batch user's request to call onnxruntime inference to avoid multiple threading. I am not sure whether it is supported by bjl-serving.

I could take a look at design change to make it thread safe. That might be targeted for 1.20 release (3+ months away).

david-sitsky commented 2 months ago

Many thanks @tianleiwu - that would be great!

microsoft / onnxruntime