triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
573 stars 234 forks source link

tensorrtllm and vllm backend results are different using genai-perf #779

Open upskyy opened 2 months ago

upskyy commented 2 months ago

Thank you for releasing a great project.

I measured genai-perf by running the rtzr/ko-gemma-2-9b-it (gemma-2-9b-it fine-tuning model) model with the tritonserver vllm backend and tritonserver tensorrt_llm backend. However, the two Output sequence length metrics are different, so I think the Output token throughput (per sec) is different.

Since output-tokens-mean was set to 100 in the argument, vllm came out as 100, and tensorrtllm seems to come out as 100 added to the input sequence length.

I ran genai-perf in nvcr.io/nvidia/tritonserver:24.07-py3-sdk docker.

Please let me know if there is anything that needs to be corrected or something I did wrong. I'll attach the script and the results.

concurrency: 1

                                    LLM Metrics

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ Request latency (ms) │ 1,606.41 │ 1,593.64 │ 1,617.31 │ 1,617.19 │ 1,616.06 │ 1,610.55 │ │ Output sequence length │ 299.50 │ 298.00 │ 300.00 │ 300.00 │ 300.00 │ 300.00 │ │ Input sequence length │ 199.75 │ 199.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ └────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ Output token throughput (per sec): 186.43 Request throughput (per sec): 0.62 2024-09-04 09:48 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.json 2024-09-04 09:48 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.csv

concurrency: 4 LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ Request latency (ms) │ 1,781.95 │ 1,740.25 │ 2,142.17 │ 2,103.83 │ 1,777.44 │ 1,765.17 │ │ Output sequence length │ 299.77 │ 298.00 │ 300.00 │ 300.00 │ 300.00 │ 300.00 │ │ Input sequence length │ 199.84 │ 199.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ └────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ Output token throughput (per sec): 649.28 Request throughput (per sec): 2.17 2024-09-04 09:51 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.json 2024-09-04 09:51 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.csv

concurrency: 8 LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ Request latency (ms) │ 2,091.10 │ 1,970.12 │ 2,943.90 │ 2,881.30 │ 2,313.61 │ 2,029.94 │ │ Output sequence length │ 299.64 │ 297.00 │ 300.00 │ 300.00 │ 300.00 │ 300.00 │ │ Input sequence length │ 199.90 │ 199.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ └────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ Output token throughput (per sec): 1054.81 Request throughput (per sec): 3.52 2024-09-04 09:53 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.json 2024-09-04 09:53 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.csv


- **vllm**

genai-perf -m rtzr_gemma2 --service-kind triton --backend vllm --num-prompts 100 --random-seed 123 --synthetic-input-tokens-mean 200 --synthetic-input-tokens-stddev 0 --output-tokens-mean 100 --output-tokens-stddev 0 --output-tokens-mean-deterministic --tokenizer rtzr/ko-gemma-2-9b-it --concurrency 1 --measurement-interval 4000 --profile-export-file my_profile_export.json --url localhost:18001

concurrency: 1

                                    LLM Metrics

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ Request latency (ms) │ 3,792.74 │ 3,781.30 │ 3,812.85 │ 3,812.27 │ 3,807.09 │ 3,798.46 │ │ Output sequence length │ 100.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 │ │ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ └────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ Output token throughput (per sec): 26.37 Request throughput (per sec): 0.26 2024-09-05 04:01 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.json 2024-09-05 04:01 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.csv

concurrency: 4

                                    LLM Metrics

┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ Request latency (ms) │ 3,996.60 │ 3,990.91 │ 4,007.69 │ 4,007.69 │ 4,007.66 │ 4,007.18 │ │ Output sequence length │ 99.67 │ 96.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 │ │ Input sequence length │ 199.75 │ 199.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ └────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ Output token throughput (per sec): 99.75 Request throughput (per sec): 1.00 2024-09-05 04:02 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.json 2024-09-05 04:02 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.csv

concurrency: 8 LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ Request latency (ms) │ 4,125.69 │ 4,090.61 │ 4,192.69 │ 4,192.68 │ 4,192.45 │ 4,191.99 │ │ Output sequence length │ 99.92 │ 98.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 │ │ Input sequence length │ 199.88 │ 199.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ └────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ Output token throughput (per sec): 193.71 Request throughput (per sec): 1.94 2024-09-05 04:04 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.json 2024-09-05 04:04 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.csv

dyastremsky commented 2 weeks ago

Apologies for the delayed response. You need to set exclude_output_in_input to true in the model config to not echo the input tokens in the output for TensorRT-LLM.

There was a limitation in TensorRT-LLM that prevented GenAI-Perf from setting this value automatically. That limitation might have been lifted recently. We have it in our queue to investigate whether GenAI-Perf can now take care of this for you.