k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.02k stars 342 forks source link

Have trouble using sherpa-onnx-offline-websocket-server with cuda provider #1053

Closed Vergissmeinicht closed 2 months ago

Vergissmeinicht commented 2 months ago

I follow the instruction from (https://k2-fsa.github.io/sherpa/onnx/websocket/offline-websocket.html ) to start a non-streaming websocket server of transducer models. It works well with the client as well. But when I try to run the client in multithread, which means, several thread using websocket client to recognize wav files one by one in the same time, server raises cuda error:

2024-06-24 09:47:01.083093543 [E:onnxruntime:, cuda_call.cc:116 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=a2d9f82c2221 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=408 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_)); 2024-06-24 09:47:01.083005575 [E:onnxruntime:, cuda_call.cc:116 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=a2d9f82c2221 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle())); terminate called after throwing an instance of 'Ort::Exception' what(): CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=a2d9f82c2221 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=408 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(stream_)); Aborted My server runs on GeForce RTX 4090 / driver 535.104.05 / CUDA version: 12.2.

Glad to have your help.

csukuangfj commented 2 months ago

Does the server work fine when you use CPU?

Vergissmeinicht commented 2 months ago

Yes, it works fine when using cpu provider.

csukuangfj commented 2 months ago

Could you tell us how you start the server? Please post the full command.

Vergissmeinicht commented 2 months ago

CUDA_VISIBLE_DEVICES=2 ./bin/sherpa-onnx-offline-websocket-server --provider=cuda --port=6006 --num-work-threads=10 --tokens=sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt --encoder=sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.onnx --decoder=sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx --joiner=sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.onnx --log-file=./log.txt --max-batch-size=5

csukuangfj commented 2 months ago

Could you change https://github.com/k2-fsa/sherpa-onnx/blob/1f95bff719c869a46be04d1f4481a4e4b0eaeb2a/sherpa-onnx/csrc/offline-websocket-server-impl.cc#L95-L98 to

 recognizer_.DecodeStreams(p_ss.data(), size); 

 lock.unlock(); 

recompile, and re-try?

Vergissmeinicht commented 2 months ago

It works fine now. So is it a bug here?

csukuangfj commented 2 months ago

It works fine now. So is it a bug here?

I think it is a bug of onnxruntime.

When using CPU, onnxruntime session is thread-safe. However, it is not thread-safe when using CUDA provider.

please see https://github.com/microsoft/onnxruntime/issues/114

manickavela29 commented 2 months ago

Hi @csukuangfj,

is the issue coming because @Vergissmeinicht is using local onnxruntime, with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

Vergissmeinicht commented 2 months ago

Hi @csukuangfj,

is the issue coming because @Vergissmeinicht is using local onnxruntime, with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

I build sherpa with no local onnxruntime. The installation of onnxruntime is provided by the cmake.

Vergissmeinicht commented 2 months ago

@csukuangfj Server been running with 200k wav files recognized, everything works fine except that memory consumption seems to increase by nearly 3G. No more modification to source code. Is it possible that memory leak may happen?

csukuangfj commented 2 months ago

Is CPU RAM or GPU RAM increased to 3G?

Do you mean 20 000 wavs or just 200 wav files?

csukuangfj commented 2 months ago

Hi @csukuangfj,

is the issue coming because @Vergissmeinicht is using local onnxruntime, with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

@Vergissmeinicht Could you look into this comment?

Vergissmeinicht commented 2 months ago

Is CPU RAM or GPU RAM increased to 3G?

Do you mean 20 000 wavs or just 200 wav files?

Been serving for 2days and now the memory consumption keeps stable. No more worry about memory leak! : )

Vergissmeinicht commented 2 months ago

Hi @csukuangfj, is the issue coming because @Vergissmeinicht is using local onnxruntime, with onnxruntime from sherpa-onnx(onnxruntime 1.17.1) it is stable in my machines

@Vergissmeinicht Could you look into this comment?

Replied to this comment already. I build the whole project inside a docker without any onnxruntime installed.

csukuangfj commented 2 months ago

Are you also running sherpa-onnx inside the docker container?

Vergissmeinicht commented 2 months ago

Are you also running sherpa-onnx inside the docker container?

Yes. I use nvidia/cuda:11.1.1-cudnn8-devel-ubuntu20.04 as my base docker.

csukuangfj commented 2 months ago

Can it be closed now?

Vergissmeinicht commented 2 months ago

Can it be closed now?

So it makes no difference whether the recognizer do decode after or before the unlock?

csukuangfj commented 2 months ago

For the CUDA provider, since onnxruntime.session is not thread-safe, we have to do decode first, and then unlock.

For the CPU provider, onnxruntime.session is thread-safe, so we can unlock first and then decode.