[Usage]: About the use of benchmark_latency.py

xwentian2020 commented 4 months ago

How would you like to use vllm

I tried to run benchmark_latency.py for profiling and collected a large JSON file (about GiB). After changing the iteration times from the default value (30) to a small number, I still found the JSON had a larger size. Such a large file is not convenient for analysis in chrome://tracing or similar tools. Could I get some suggestions from you to decrease the size of the JSON file? Or a tool recommended for analysis is welcomed as well. Thanks.

xwentian2020 commented 4 months ago

A test was ever made on a A800 GPU card, and the command below was used to do collection: python benchmark_latency.py --batch-size 128 --model /ssd/hf_models/llama-2-7b-hf/ --dtype float16 --device cuda --input-len 128 --output-len 512 --profile --profile-result-dir /path-to-vllm/benchmarks/benchmark_latency_results/.

The derived JSON file was about 0.5 GiB in size. In the JSON file, the instances of _paged_attention_v1kernel() or _paged_attention_v2kernel() were not found, while instances of other kernels were found, like flash_fwd_kernel(), rotary_embedding_kernel(), vllm::reshape_and_cache_flash_kernel(), and so on.

In the collected info about kernels, I found there were 3+512 iteration times, or totally 515 iteration times. The 512 iteration times appeared to be related to decoding, or 512 steps, each generating one token. However I knew little about the first three iteration times and hoped to get hints about their use. What I knew about them was that one iteration time was for prefilling.

BTW, in each of the first three iteration times, there were 363 kernel instances, where in the later 512 iteration times, each calling totally 397 kernel instances.

I do not have much experience in LLM but I am interested in the performance issues. Thanks for feedback from the developers of this project.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

vllm-project / vllm

[Usage]: About the use of benchmark_latency.py #5787

How would you like to use vllm