Closed tn-17 closed 5 months ago
I installed onnxruntime-gpu specifically for CUDA 12.x following the instructions from
That won't affect the onnxruntime used in sherpa-onnx.
Could you try CUDA 11.8 since we are using onnxruntime 1.17.1 in sherpa-onnx ?
I have uninstalled CUDA 12.4 and installed CUDA 11.8.
Then I used python setup.py install
again to rebuild and install sherpa-onnx for Nvidia GPU.
Then I ran the offline-tts-play.py
example again. This got past the onnxruntime_providers_cuda.dll
error. However, a new error appeared.
Could not locate cublasLt64_12.dll. Please make sure it is in your library path!
After some google searching, I think this means that I need to update CUDA to a newer version?
python piper_stream_example.py --vits-model=./en_US-libritts_r-medium.onnx --vits-tokens=./tokens.txt --vits-data-dir=./espeak-ng-data --output-filename=./test.wav --provider cuda --debug True 'This is a test'
Namespace(vits_model='./en_US-libritts_r-medium.onnx', vits_lexicon='', vits_tokens='./tokens.txt', vits_data_dir='./espeak-ng-data', vits_dict_dir='', tts_rule_fsts='', output_filename='./test.wav', sid=0, debug=True, provider='cuda', num_threads=1, speed=1.0, text='This is a test')
2024-05-15 00:12:24,597 INFO [piper_stream_example.py:320] Loading model ...
2024-05-15 00:12:26.0113635 [W:onnxruntime:, transformer_memcpy.cc:74 onnxruntime::MemcpyTransformer::ApplyImpl] 28 Memcpy nodes are added to the graph torch_jit for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-05-15 00:12:26.0302070 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-15 00:12:26.0345325 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Could not locate cublasLt64_12.dll. Please make sure it is in your library path!
CUDA 11.8 contains cublasLt64_11.dll
so I uninstalled 11.8 and installed 12.2.
I did not rebuild and reinstall sherpa-onnx for gpu.
I tried running the offline-tts-play.py
example and encountered the onnxruntime_providers_cuda.dll
error again.
Next, I reinstalled 11.8, keeping 12.2 as well since it is possible to have multiple installations.
I updated the path back to 11.8.
I retried the example py again and it got further. This time, it produced an error about zlibwapi.dll
.
I got zlibwapi.dll
from http://www.winimage.com/zLibDll/ as per NVIDIA Cuda and CUDNN installation instructions.
python piper_stream_example.py --vits-model=./en_US-libritts_r-medium.onnx --vits-tokens=./tokens.txt --vits-data-dir=./espeak-ng-data --output-filename=./test.wav --provider cuda --debug True 'This is a test'
Namespace(vits_model='./en_US-libritts_r-medium.onnx', vits_lexicon='', vits_tokens='./tokens.txt', vits_data_dir='./espeak-ng-data', vits_dict_dir='', tts_rule_fsts='', output_filename='./test.wav', sid=0, debug=True, provider='cuda', num_threads=1, speed=1.0, text='This is a test')
2024-05-15 00:32:10,848 INFO [piper_stream_example.py:320] Loading model ...
2024-05-15 00:32:11.9375753 [W:onnxruntime:, transformer_memcpy.cc:74 onnxruntime::MemcpyTransformer::ApplyImpl] 28 Memcpy nodes are added to the graph torch_jit for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-05-15 00:32:11.9569339 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-15 00:32:11.9607816 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
C:\Users\T\Desktop\Code\ai\stella\sherpa-onnx\sherpa-onnx\csrc\offline-tts-vits-model.cc:Init:79 ---vits model---
model_type=vits
comment=piper
has_espeak=1
language=English
voice=en-us
n_speakers=904
sample_rate=22050
----------input names----------
0 input
1 input_lengths
2 scales
3 sid
----------output names----------
0 output
2024-05-15 00:32:12,393 INFO [piper_stream_example.py:322] Loading model done.
2024-05-15 00:32:12,394 INFO [piper_stream_example.py:330] Start generating ...
C:\Users\T\Desktop\Code\ai\stella\sherpa-onnx\sherpa-onnx/csrc/offline-tts-vits-impl.h:Generate:165 Raw text: This is a test
Could not load library zlibwapi.dll. Error code 193. Please verify that the library is built correctly for your processor architecture (32-bit, 64-bit)
(venv)
I was using the precomplied ddls for 32 bit. I downloaded the correct 64 bit ones from http://www.winimage.com/zLibDll/ and now there are no errors when running the offline-tts-play.py
example.
Thank you for help! @csukuangfj
By the way, is there a performance issue with onnxruntime gpu?
I am finding that cpu is faster than gpu when measuring the "time in seconds to receive the first message" for generating the tts audio.
My gpu is RTX 3090. My cpu is i9-14900k.
2024-05-15 00:50:46.3377129 [W:onnxruntime:, transformer_memcpy.cc:74 onnxruntime::MemcpyTransformer::ApplyImpl] 28 Memcpy nodes are added to the graph torch_jit for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2024-05-15 00:50:46.3555852 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-05-15 00:50:46.3603249 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Glad to hear that you finally managed to run sherpa-onnx with GPU on Windows.
I am finding that cpu is faster than gpu when measuring the "time in seconds to receive the first message" for generating the tts audio.
GPU needs warmup also the advantage of GPU is parallel processing.
Moving data between CPU and GPU also takes time. In other words, GPU is not necessarily faster than CPU if you want to synthesize a single utterance.
Since www.winimage.com is unreachable now, I just upload the dll here zlib123dllx64.zip (Downloaded from http://www.winimage.com/zLibDll/zlib123dllx64.zip).
You can try placing the zlibwapi.dll
file into the CUDA directory (such as C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin).
I am unsure if this is an issue with sherpa-onnx gpu installation or onnxruntime-gpu installation.
I am using Windows 11, python 3.10.11.
I have CUDA 12.4, CUDNN 8.9.2.26 and zlib 1.3.1 installed and added to the PATH.
I followed the requirement guidelines from: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements
"ONNX Runtime built with CUDA 11.8 should be compatible with any CUDA 11.x version; ONNX Runtime built with CUDA 12.2 should be compatible with any CUDA 12.x version.
ONNX Runtime built with cuDNN 8.x are not compatible with cuDNN 9.x."
I installed
onnxruntime-gpu
specifically for CUDA 12.x following the instructions from https://onnxruntime.ai/docs/installI ran
python setup.py install
from the sherpa-onnx repo directory following the Method 2 Nvidia GPU (CUDA) install instructions at https://k2-fsa.github.io/sherpa/onnx/python/install.htmlThe installation succeeds but running tts-offline-example (https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/offline-tts-play.py) with
--provider cuda
results in the following error.Running
offline-tts-play.py
example code Logs:Some errors that appear in the build logs are: -- Failed to find all ICU components (missing: ICU_INCLUDE_DIR ICU_LIBRARY _ICU_REQUIRED_LIBS_FOUND) -- Could NOT find ZLIB (missing: ZLIB_INCLUDE_DIR) -- Could NOT find ASIOSDK (missing: ASIOSDK_ROOT_DIR ASIOSDK_INCLUDE_DIR)
python setup.py install
Logs: