Slow Model Initialization

w11wo commented 1 year ago

Hi, I've just been playing with sherpa-onnx and I find that the model initialization is kinda slow compared to sherpa-ncnn. I noticed this in different desktop OS's (tested on macOS, Linux, and Windows). Oddly enough, I didn't notice a slow initialization on my iOS device. I wonder if we could somehow speed it up?

A minimal reproducible example is as follows, in Python.

import sherpa_onnx
from pathlib import Path

model_path = Path("/root/sherpa-onnx-streaming-zipformer-en-2023-06-26")
recognizer = sherpa_onnx.OnlineRecognizer(
    tokens=str(model_path / "tokens.txt"),
    encoder=str(model_path / "encoder-epoch-99-avg-1-chunk-16-left-64.onnx"),
    decoder=str(model_path / "decoder-epoch-99-avg-1-chunk-16-left-64.onnx"),
    joiner=str(model_path / "joiner-epoch-99-avg-1-chunk-16-left-64.onnx"),
    num_threads=1,
    enable_endpoint_detection=False,
    rule1_min_trailing_silence=2.4,
    rule2_min_trailing_silence=1.2,
    rule3_min_utterance_length=30,
    decoding_method="modified_beam_search",
    max_active_paths=4,
    provider="cpu",
)

This sample code takes about 4-5 seconds to initialize the model.

The model is csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26.

Thanks!

jingzhaoou commented 1 year ago

I noticed similar delays on the model initialization. sherpa-onnx itself has a very small amount of processing during warm-up. This is probably just an ONNX runtime thing IMO. There might be some graph optimizations, mapping operators to different execution providers, etc. that take time. When I added TensorRT execution provider to a session, the model initialization was much much longer.

w11wo commented 1 year ago

Hi @jingzhaoou, thanks for the info. I agree that it might be just an ONNX thing and not a sherpa-onnx thing.

I noticed that the initialization for the int8 quantized model is also slightly faster, so it's probably related to the model graph and the operators. I'm still trying to find out if there's a way to speed it up. Cheers!

csukuangfj commented 1 year ago

There is an easy way to reduce the initialization time by half.

I just fixed the initialization time for non-streaming models in https://github.com/k2-fsa/sherpa-onnx/pull/213 Similar things can be done to streaming models.

You can change https://github.com/k2-fsa/sherpa-onnx/blob/fe0630fe1fba9ff8a7fe72cc7553596688d5d79b/sherpa-onnx/csrc/online-transducer-model.cc#L82-L86

If we pass the model type from the commandline instead of reading it from the model file, we can half the model loading time.

Contributions to fix it from the community is appreciated.

w11wo commented 1 year ago

I was able to convert the models to ORT format, which you can find here w11wo/sherpa-onnx-ort-streaming-zipformer-en-2023-06-26. After conversion, the ORT fp32 model took 1.6s and the int8 model took only 0.6s to initialize.

These initialization times are much closer to that of sherpa-ncnn. And surprisingly, the ORT models worked out of the box with the sherpa-onnx frontends in Python and iOS!

csukuangfj commented 1 year ago

@w11wo

That is great!

Could you describe how you convert it?

w11wo commented 1 year ago

@csukuangfj. I have documented it in the README of the repo, but it's essentially just a one-liner that looks like this

python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed {model_path}.onnx

This tool comes with the installation of onnxruntime via pip.

I opted to set the optimization_style to Fixed instead of Runtime as the latter is intended mainly for NNAPI/CoreML support. But since the models are seemingly not yet fully supported by CoreML (as discussed here), I went with the former.

There's a handful of things I want to test out with this. I will start by creating my own custom iOS lib, which I'll track here. Cheers!

k2-fsa / sherpa-onnx

Slow Model Initialization #211