vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.25k stars 3.14k forks source link

[RFC]: Usage Data Enhancement for v0.5.* #5520

Open simon-mo opened 3 weeks ago

simon-mo commented 3 weeks ago

Motivation.

vLLM currently has a usage reporting feature https://docs.vllm.ai/en/stable/serving/usage_stats.html to inform us what features can be safely deprecated or what hardware to improve performance on.

After v0.5.0, vLLM has various features that's being tested (chunked prefill, prefix caching, spec decode, fp8, and VLM), we would like to start gathering statistics on the usage of these features with different hardware and model types so we know what we are tested on.

Proposed Change.

Add the following data to usage_lib

Another missing value from previous data is the size of the model, so we find it difficult to compare llama3 8b vs 70b. This might require some creative way to find the size of the model without capturing too much information.

Any other suggestion welcomed.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

robertgshaw2-neuralmagic commented 2 weeks ago

I would love to see if any quantization methods are being used

robertgshaw2-neuralmagic commented 2 weeks ago

Re: model size

RobertFischer commented 2 weeks ago

We already capture the size of the model during profiling (GPU memory)

Where can I pull that information from?