Open raikarsagar opened 1 year ago
@yuekaizhang Could you please have a look?
Hi, you could use sherpa/trion/client/decode_manifest.py to decode a whole dataset. This is a reference for benchmarking a Chinese dataset https://k2-fsa.github.io/sherpa/triton/client/index.html#decode-manifests.
When server launched, you could use this soar97/triton-k2:22.12.1 pre-built docker for client. You also need to prepare dataset by yourself. This is a reference https://colab.research.google.com/drive/1JX5Ph2onYm1ZjNP_94eGqZ-DIRMLlIca?usp=sharing.
@yuekaizhang @csukuangfj Hi, I was able to setup the triton server with zipformer streaming model successfully. But there seems to be some disconnect between the RTF numbers we are able to achieve using client.py with a custom cutset VS the perf-analyzer throughput numbers we are seeing. So here are the initial Throughput, RTF and latency number we are able to achieve:
Triton - perf-analyzer:
Using client.py
RTF: 0.0082 -> RTFX=121.95
total_duration: 70140.002 seconds (19.48 hours)
processing time: 574.573 seconds (0.16 hours)
latency_variance: 55.60
latency_50_percentile: 1058.63
latency_90_percentile: 1144.97
latency_99_percentile: 1280.65
average_latency_ms: 1029.45
NOTE: num_tasks was set to 200 which is comparable to concurrency 200 in the above perf_benchmark test.
I have some questions:
Thanks in advance
For triton ASR benchmark, I strongly recomand you to use https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py.
Perf_analyzer is useful if you are interested to detailed module (e.g. encoder, decoder, queue time, infer time)cost.
Would you mind sharing the stats_summary.txt generated by client.py here also?
Understood, I am sharing the stats_summary here: This summary is for a run with num_workers as 100. Looking at this, it seems like initial inferences are taking more time. Would you recommend explicit warmup? stats_summary.txt
Understood, I am sharing the stats_summary here: This summary is for a run with num_workers as 100. Looking at this, it seems like initial inferences are taking more time. Would you recommend explicit warmup? stats_summary.txt
Triton config has a warmup option. For benchmark, you may discard the intial results and use the stable executation runs.
Warmup has to be configured in the model repo dir? or specific config.pbtxt? It would be great if you could point me. Also, in the stats file, I see there are batch sizes varying from 1 to N , is this profiling for all various batch sizes? or each inference run times?
For stats.json https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md
For stats_summary.txt, I just convert it from stats.json.
e.g. "batch_size 19, 18 times, infer 7875.14 ms, avg 437.51 ms, 23.03 ms input 47.86 ms, avg 2.66 ms, output 35.48 ms, avg 1.97 ms "
Since the service was started, there are total 18 times execuation were conducted with batch_size 19. The 18 times cost 7875.14 ms to finish. Avg is 7875.14/18, 7875.14/18/19, input output are for host to device, device to host time.
@yuekaizhang Do you have any standard benchmark tests which were done for conformer-transducer/zipformer-transducer models? I am not seeing any improvement with warmup or other configurations. With sherpa infact, the non-triton setup is working far better than triton.
We have not started benchmark and profilling yet. How do you cofing your warmup setting? Also, later we will support tensorrt backend which should have less time to warmup comparing with onnx.
@yuekaizhang Since the zipformer steaming model is sequential, I just warmed up with some dry runs. Also, wanted to check about the logs on github client repo where RTF numbers look good here. May I know what model is this? We would like to reproduce these results on a GPU instance.
@yuekaizhang Since the zipformer steaming model is sequential, I just warmed up with some dry runs. Also, wanted to check about the logs on github client repo where RTF numbers look good here. May I know what model is this? We would like to reproduce these results on a GPU instance.
https://huggingface.co/yuekai/model_repo_streaming_conformer_wenetspeech_icefall/tree/main It's from this model_repo. To reproduce, you may try """ git-lfs install git clone ... """ However, it used aishell1 test set, which is for Chinese.
Hi, Do we have standard RTFX & Latency numbers for streaming & non streaming pruned transducer stateless X ? I am configuring triton perf benchmarking. Let me know if any specific steps to be followed for benchmarking apart from perf benchmarker.