Export silero-vad ONNX to TensorRT

EnlNovius commented 2 years ago

Export silero-vad ONNX to TensorRT

I am trying to translate the supplied ONXX network (files/silero_vad.onnx) to TensorRT (.trt). I tried two tools:

trtexec from nvidia (https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec)
onnx2trté (https://github.com/onnx/onnx-tensorrt)

With both tools, I get the following error:

$ trtexec --onnx=files/silero_vad.onnx
...
[08/03/2022-14:44:43] [E] Error[4]: [shuffleNode.cpp::symbolicExecute::392] Error Code 4: Internal Error (Reshape_17: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:773: While parsing node number 27 [Pad -> "152"]:
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:775: input: "129"
...
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[08/03/2022-14:44:43] [E] [TRT] ModelImporter.cpp:779: ERROR: ModelImporter.cpp:180 In function parseGraph:
[6] Invalid Node - Pad_27
[shuffleNode.cpp::symbolicExecute::392] Error Code 4: Internal Error (Reshape_17: IShuffleLayer applied to shape tensor must have 0 or 1 reshape dimensions: dimensions were [-1,2])
[08/03/2022-14:44:43] [E] Failed to parse onnx file

From what I could find, the problem would come from the fact that TensorRt does not currently manage tensor2D (https://forums.developer.nvidia.com/t/ishufflelayer-applied-to-shape-tensor-must-have-0-or-1-reshape-dimensions-dimensions-were-1-2/200183). A solution proposed in response is to use polygraphy with surgeon:

$  polygraphy surgeon sanitize --fold-constants files/silero_vad.onnx -o files/silero_vad_surgeon.onnx

Now if I apply one of my two tools, the problem seems to be solved, but another problem arises further on:

$ trtexec --onnx=files/silero_vad_surgeon.onnx
...
[08/03/2022-14:46:22] [E] Error[4]: If_33_OutputLayer: IIfConditionalOutputLayer inputs must have the same shape.
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:773: While parsing node number 10 [If -> "158"]:
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:775: input: "157"
...
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[08/03/2022-14:46:22] [E] [TRT] ModelImporter.cpp:779: ERROR: ModelImporter.cpp:180 In function parseGraph:
[6] Invalid Node - If_33
If_33_OutputLayer: IIfConditionalOutputLayer inputs must have the same shape.
[08/03/2022-14:46:22] [E] Failed to parse onnx file

I haven't found a solution to this problem yet, does anyone have any idea how to solve this problem?

snakers4 commented 2 years ago

TensorRt does not currently manage tensor2D

Interesting, 95% of the network consists of either 1D convolutions or Linear layers. The network contains an internal normalization layer (with a padding), which most likely is the cause of the problem, since it always gave us grief during exports. This is one of the reasons we decided to keep PyTorch and ONNX formats.

I am trying to translate the supplied ONXX network (files/silero_vad.onnx) to TensorRT (.trt).

But the main question is, why?

EnlNovius commented 2 years ago

But the main question is, why?

I'm trying to use silero-vad in real time on an embedded system already using several TensorRt neural networks. Switching from a pytorch model to a TensorRt model normally allows optimization at the inference level (https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/).

EnlNovius commented 2 years ago

I just tried onnx_simplifier (https://github.com/daquexian/onnx-simplifier). The simplification of the tree solves both the first and the second problem.

$ onnxsim files/silero_vad.onnx files/silero_vad_onnxsim.onnx

However, a new problem arises:

$ trtexec --onnx=files/silero_vad_onnxsim.onnx
...
08/03/2022-16:00:23] [E] Error[4]: [graphShapeAnalyzer.cpp::processCheck::587] Error Code 4: Internal Error (Conv_81: spatial dimension of convolution output cannot be negative (build-time output dimension of axis 2 is -5))
[08/03/2022-16:00:23] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
...

EnlNovius commented 2 years ago

Problem solved:

Using onnx2trt I got the following error:

[2022-08-04 09:00:13   ERROR] 4: [network.cpp::validate::2965] Error Code 4: Internal Error (Network has dynamic or shape inputs, but no optimization profile has been defined.)

I couldn't find a solution to solve this problem with the two tools, so I went back to the nvidia documentation (https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/ and https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes)

Solution

First, simplify the ONNX file:

$ onnxsim files/silero_vad.onnx files/silero_vad_onnxsim.onnx

Convert using python (or c++) the file to TensorRt:

Click here to expand

```python import tensorrt as trt from onnx import ModelProto import tensorrt as trt TRT_LOGGER = trt.Logger(trt.Logger.WARNING) trt_runtime = trt.Runtime(TRT_LOGGER) def build_engine(onnx_path, shape): """ This is the function to create the TensorRT engine Args: onnx_path : Path to onnx_file. shape : Shape of the input of the ONNX file. """ with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, builder.create_builder_config() as config, trt.OnnxParser(network, TRT_LOGGER) as parser: config.max_workspace_size = (256 << 20) profile = builder.create_optimization_profile(); profile.set_shape("input", (1536,), (1536,), (1536,)) # Or (512,), (1024,), (1536,) if we want something flexible config.add_optimization_profile(profile) with open(onnx_path, 'rb') as model: parser.parse(model.read()) network.get_input(0).shape = shape engine = builder.build_engine(network, config) return engine def save_engine(engine, file_name): buf = engine.serialize() with open(file_name, 'wb') as f: f.write(buf) onnx_path = "files/silero_vad_onnxsim.onnx" model = ModelProto() with open(onnx_path, "rb") as f: model.ParseFromString(f.read()) shape = [1536] # The value here does not matter, it just has to be large enough to avoid the appearance of a negative dimension. engine = build_engine(onnx_path, shape=shape) save_engine(engine, "files/silero_vad.engine") ```

Code to test with silero:

New class TrtWrapper in utils_vad.py

```python import tensorrt as trt import pycuda.driver as cuda import pycuda.autoinit # You need this to init cuda from collections import OrderedDict def swap_on_key(d: OrderedDict, key1, key2): tmp = d[key1] d[key1] = d[key2] d[key2] = tmp class TrtWrapper: def __init__(self, path): trt_logger = trt.Logger(trt.ILogger.Severity.INFO) assert os.path.exists(path) print("Reading engine from file {}".format(path)) with open(path, "rb") as f, trt.Runtime(trt_logger) as runtime: self.engine: Optional[trt.ICudaEngine] = runtime.deserialize_cuda_engine(f.read()) self.context: Optional[trt.IExecutionContext] = None self.memories: Optional[OrderedDict[str, Optional[int]]] = None self.stream: Optional[cuda.Stream] = None self.output_buffer = None def start(self, chunk_duration): self.context = self.engine.create_execution_context() self.context.set_binding_shape(self.engine.get_binding_index("input"), (chunk_duration, )) self.memories = OrderedDict() for binding in self.engine: binding_idx = self.engine.get_binding_index(binding) size = trt.volume(self.context.get_binding_shape(binding_idx)) dtype = trt.nptype(self.engine.get_binding_dtype(binding)) memory = cuda.mem_alloc(np.zeros(size, dtype=dtype).nbytes) # TODO better solution to compute size ? self.memories[self.engine.get_binding_name(binding_idx)] = memory if self.engine.get_binding_name(binding_idx) == 'output': self.output_buffer = cuda.pagelocked_empty(size, dtype) assert all(self.memories.values()) assert self.output_buffer is not None self.stream = cuda.Stream() self.reset_states() def swap_buffers(self): swap_on_key(self.memories, 'h0', 'hn') swap_on_key(self.memories, 'c0', 'cn') def reset_states(self): cuda.memcpy_htod_async(self.memories['h0'], np.zeros((2, 1, 64), dtype=np.float32), self.stream) cuda.memcpy_htod_async(self.memories['c0'], np.zeros((2, 1, 64), dtype=np.float32), self.stream) def __call__(self, x, sr: int): x = np.ascontiguousarray(x) # Transfer input data to the GPU. cuda.memcpy_htod_async(self.memories['input'], x, self.stream) # Run inference self.context.execute_async_v2(bindings=list(self.memories.values()), stream_handle=self.stream.handle) # swap input/output buffers h and c self.swap_buffers() # Transfer prediction output from the GPU. cuda.memcpy_dtoh_async(self.output_buffer, self.memories['output'], self.stream) # Synchronize the stream self.stream.synchronize() out = torch.tensor(self.output_buffer)[1] return out def close(self): self.context = None self.memories = None self.stream = None self.output_buffer = None ```

Code for test

```python SAMPLING_RATE = 16000 chunk_duration = 1536 from utils_vad import get_speech_timestamps, read_audio, TrtWrapper import time from pprint import pprint import onnxruntime as ort import numpy as np # model = OnnxWrapper('files/silero_vad.onnx') # model = OnnxWrapper('files/silero_vad_onnxsim.onnx') model = TrtWrapper('files/silero_vad.engine') model.start(chunk_duration) # Use start to init buffer and cuda file = 'my_file.wav' wav = read_audio(file, sampling_rate=SAMPLING_RATE) # get speech timestamps from full audio file t0 = time.time() speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE, window_size_samples=chunk_duration, min_silence_duration_ms=0) print(time.time() - t0) pprint(speech_timestamps) ```

Execution time

Using the Code for test on a 4 minutes audio file, average time of get_speech_timestamps function (average on 100 calls):

Model	Average time to process an audio file of 4min
torch.jit.load('files/silero_vad.jit')	4.34s
OnnxWrapper('files/silero_vad.onnx')	1.11s
OnnxWrapper('files/silero_vad_onnxsim.onnx')	0.93s
TrtWrapper('files/silero_vad.engine')	1.03s

snakers4 commented 2 years ago

on a 4 minutes audio file

One audio chunk should take about ~1ms on one CPU thread. ONNX was similar.

By a simple calculation, a 4 minute audio file should take about:

4 mins 60 seconds per minute 16000 samples per second / 1536 samples per chunk * 1 ms per chunk / 1000 ms per second = 2.5 seconds

The fact that this takes ~40 times slower is strange.

import pycuda.autoinit # You need this to init cuda

There is very little reason to run VAD on CUDA, because there is very little to gain. The model is very fast, most likely it will just incur an overhead to copy to and from the GPU.

EnlNovius commented 2 years ago

The fact that this takes ~40 times slower is strange.

I wasn't clear in my results, I edited the post to correct that. The results that were displayed were those for running the network 100 times over 4min of video, so we were at 0.5ms / frame.

There is very little reason to run VAD on CUDA, because there is very little to gain. The model is very fast, most likely it will just incur an overhead to copy to and from the GPU.

For my project, the audio data is already on the gpu, since it goes through other networks. According to the documentation "Using batching or GPU can also improve performance considerably.. Having already the data on GPU and the GPU being supposed to accelerate the processing times a bit more, I wanted to try to see what it could give, even if the performance on silero CPU is already remarkable.

Since I managed to run silero on TensorRT, I think the issue is complete. Thanks for the answers and for the very good work done on silero-vad :smiley:

snakers4 commented 2 years ago

so we were at 0.5ms / frame.

Seems in line with our bechmarks, albeit we did not test on GPU.

"Using batching or GPU can also improve performance considerably.. Having already the data on GPU and the GPU being supposed to accelerate the processing times a bit more,

Well, batching would work for multiple streams at the same time better. You can find more details in the discussion via this link. Basically each batch element in one separate stream.

In any case many thanks for your input on this conversion. Hope someone finds it useful for their usecase.

Since I managed to run silero on TensorRT

Another questions is whether the model outputs are similar.

I will create a copy of this ticket as a discussion.

snakers4 / silero-vad