After v0.5.0, vLLM has various features that's being tested (chunked prefill, prefix caching, spec decode, fp8, and VLM), we would like to start gathering statistics on the usage of these features with different hardware and model types so we know what we are tested on.
Proposed Change.
Add the following data to usage_lib
--enable-chunked-prefill
--enable-prefix-cache
speculative_model (need model architecture/size or [ngram])
Another missing value from previous data is the size of the model, so we find it difficult to compare llama3 8b vs 70b. This might require some creative way to find the size of the model without capturing too much information.
Motivation.
vLLM currently has a usage reporting feature https://docs.vllm.ai/en/stable/serving/usage_stats.html to inform us what features can be safely deprecated or what hardware to improve performance on.
After v0.5.0, vLLM has various features that's being tested (chunked prefill, prefix caching, spec decode, fp8, and VLM), we would like to start gathering statistics on the usage of these features with different hardware and model types so we know what we are tested on.
Proposed Change.
Add the following data to
usage_lib
--enable-chunked-prefill
--enable-prefix-cache
speculative_model
(need model architecture/size or [ngram])Another missing value from previous data is the size of the model, so we find it difficult to compare llama3 8b vs 70b. This might require some creative way to find the size of the model without capturing too much information.
Any other suggestion welcomed.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response