k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
518 stars 104 forks source link

RTFX and Latency numbers for streaming pruned transducer stateless X #306

Open raikarsagar opened 1 year ago

raikarsagar commented 1 year ago

Hi, Do we have standard RTFX & Latency numbers for streaming & non streaming pruned transducer stateless X ? I am configuring triton perf benchmarking. Let me know if any specific steps to be followed for benchmarking apart from perf benchmarker.

csukuangfj commented 1 year ago

@yuekaizhang Could you please have a look?

yuekaizhang commented 1 year ago

Hi, you could use sherpa/trion/client/decode_manifest.py to decode a whole dataset. This is a reference for benchmarking a Chinese dataset https://k2-fsa.github.io/sherpa/triton/client/index.html#decode-manifests.

When server launched, you could use this soar97/triton-k2:22.12.1 pre-built docker for client. You also need to prepare dataset by yourself. This is a reference https://colab.research.google.com/drive/1JX5Ph2onYm1ZjNP_94eGqZ-DIRMLlIca?usp=sharing.

uni-sagar-raikar commented 1 year ago

@yuekaizhang @csukuangfj Hi, I was able to setup the triton server with zipformer streaming model successfully. But there seems to be some disconnect between the RTF numbers we are able to achieve using client.py with a custom cutset VS the perf-analyzer throughput numbers we are seeing. So here are the initial Throughput, RTF and latency number we are able to achieve:

I have some questions:

  1. What is the difference between Throughput in perf_analyzer VS RTF in client.py? Can we compare the perf-analyzer throughput with RTFX?
  2. If both thoughput and RTFX are comparable then why are we seeing such a difference in testing across 2 setups. Am I missing some setting here? or the client has to be modified in someway?

Thanks in advance

yuekaizhang commented 1 year ago
  1. Throughput is not RTFx. Throughput computation is a little bit complex.
  2. Difference between perf_analyzer and https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py: perf_analyzer --streaming using a single wav file, however, client.py could use a dataset. Also, with --simulate-streaming option in client.py, you are able to send audio chunks considering the chunk time.

For triton ASR benchmark, I strongly recomand you to use https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py.

Perf_analyzer is useful if you are interested to detailed module (e.g. encoder, decoder, queue time, infer time)cost.

Would you mind sharing the stats_summary.txt generated by client.py here also?

uni-sagar-raikar commented 1 year ago

Understood, I am sharing the stats_summary here: This summary is for a run with num_workers as 100. Looking at this, it seems like initial inferences are taking more time. Would you recommend explicit warmup? stats_summary.txt

yuekaizhang commented 1 year ago

Understood, I am sharing the stats_summary here: This summary is for a run with num_workers as 100. Looking at this, it seems like initial inferences are taking more time. Would you recommend explicit warmup? stats_summary.txt

Triton config has a warmup option. For benchmark, you may discard the intial results and use the stable executation runs.

uni-sagar-raikar commented 1 year ago

Warmup has to be configured in the model repo dir? or specific config.pbtxt? It would be great if you could point me. Also, in the stats file, I see there are batch sizes varying from 1 to N , is this profiling for all various batch sizes? or each inference run times?

yuekaizhang commented 1 year ago

For warmup https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#model-warmup

For stats.json https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md

For stats_summary.txt, I just convert it from stats.json.

e.g. "batch_size 19, 18 times, infer 7875.14 ms, avg 437.51 ms, 23.03 ms input 47.86 ms, avg 2.66 ms, output 35.48 ms, avg 1.97 ms "

Since the service was started, there are total 18 times execuation were conducted with batch_size 19. The 18 times cost 7875.14 ms to finish. Avg is 7875.14/18, 7875.14/18/19, input output are for host to device, device to host time.

uni-sagar-raikar commented 1 year ago

@yuekaizhang Do you have any standard benchmark tests which were done for conformer-transducer/zipformer-transducer models? I am not seeing any improvement with warmup or other configurations. With sherpa infact, the non-triton setup is working far better than triton.

yuekaizhang commented 1 year ago

We have not started benchmark and profilling yet. How do you cofing your warmup setting? Also, later we will support tensorrt backend which should have less time to warmup comparing with onnx.

uni-sagar-raikar commented 1 year ago

@yuekaizhang Since the zipformer steaming model is sequential, I just warmed up with some dry runs. Also, wanted to check about the logs on github client repo where RTF numbers look good here. May I know what model is this? We would like to reproduce these results on a GPU instance.

yuekaizhang commented 1 year ago

@yuekaizhang Since the zipformer steaming model is sequential, I just warmed up with some dry runs. Also, wanted to check about the logs on github client repo where RTF numbers look good here. May I know what model is this? We would like to reproduce these results on a GPU instance.

https://huggingface.co/yuekai/model_repo_streaming_conformer_wenetspeech_icefall/tree/main It's from this model_repo. To reproduce, you may try """ git-lfs install git clone ... """ However, it used aishell1 test set, which is for Chinese.