[RFC]: Usage Data Enhancement for v0.5.*

simon-mo commented 3 weeks ago

Motivation.

vLLM currently has a usage reporting feature https://docs.vllm.ai/en/stable/serving/usage_stats.html to inform us what features can be safely deprecated or what hardware to improve performance on.

After v0.5.0, vLLM has various features that's being tested (chunked prefill, prefix caching, spec decode, fp8, and VLM), we would like to start gathering statistics on the usage of these features with different hardware and model types so we know what we are tested on.

Proposed Change.

Add the following data to usage_lib

--enable-chunked-prefill
--enable-prefix-cache
speculative_model (need model architecture/size or [ngram])

Another missing value from previous data is the size of the model, so we find it difficult to compare llama3 8b vs 70b. This might require some creative way to find the size of the model without capturing too much information.

Any other suggestion welcomed.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

robertgshaw2-neuralmagic commented 2 weeks ago

I would love to see if any quantization methods are being used

robertgshaw2-neuralmagic commented 2 weeks ago

Re: model size

We could look at the number of layers in the HF config and have an internal mapping for the key models
We already capture the size of the model during profiling (GPU memory)

RobertFischer commented 2 weeks ago

We already capture the size of the model during profiling (GPU memory)

Where can I pull that information from?

vllm-project / vllm