k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.26k stars 380 forks source link

[Feature] Handling onnxrt execution provider config for various models #1098

Closed manickavela29 closed 2 months ago

manickavela29 commented 2 months ago

Hi @csukuangfj,

With https://github.com/k2-fsa/sherpa-onnx/pull/992, config for backends are handled as arguments and it is done.

But there is an additional issue with arguments and models, as the suggested configs are not specific to models In the case of zipformer,

Solution : for zipformer case, Adding encoder_config, decoder_config and joiner_config https://github.com/k2-fsa/sherpa-onnx/blob/3e4307e2fb88d4b1b648211c14f2fff6db11bca4/sherpa-onnx/csrc/online-transducer-model-config.h#L14-L17 and a new overloaded function at providers and session All the config will be hardcoded for specific model and not as argument while starting sherpa,

But if you have any better suggestions let me know

csukuangfj commented 2 months ago

Adding encoder_config, decoder_config and joiner_config

That would introduce too many command line arguments


How about creating separate sess_opts_ for the decoder and the joiner and hard-coding the config values if tensorrt is used?

manickavela29 commented 2 months ago

Yes, I meant on similar lines,

By adding separate configs for encoder,decoder and joiner, I actually meant they will be hardcoded and not exposed as an argument.

And as you suggested, they will have separate sessopts which will be built with their hardcoded custom config

manickavela29 commented 2 months ago

Actually, given the mode size of decoder and joiner, we can as well just running with CUDA EP itself, since encoder is the only heavy lifter here