vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.04k stars 3.11k forks source link

[RFC]: Postmerge performance suite #4926

Open simon-mo opened 1 month ago

simon-mo commented 1 month ago

Motivation.

We want to start tracking performance numbers of vLLM on more realistic workloads. Thanks to our sponsors #4925 we are getting a pool of hardware resources ready to run the testing on.

The goal of this test suite is to

  1. Track regression
  2. Track our progress in optimization

Proposed Change.

We will start with running the following benchmarks:

We will run with the following parameters:

We will run with the following tests:

We will also compare with TGI and TRT-LLM.

Feedback Period.

Step 1: Ensure hardware availabilities Step 2: Setup pipeline for Llama 8B on H100 as a proof of concept Step 3: Monitor the result, build dashboard Step 4: Scale to other tests as resources come online.

CC List.

No response

Any Other Things.

Suggestion welcomed.

simon-mo commented 1 month ago

Other notes:

AaronFriel commented 1 month ago

For prefix caching, I think few-shot evals where each eval shares the same prompt but the Q&A pair are distinct would be appropriate. Evaluating requests/second or generated tokens/second will give users an idea of how quickly an agent with a static prompt to guide it will run.

I haven't dug into all of the eval frameworks, but I understand many of them are few-shot. MMLU for example, I think uses the same 5 examples.

MMLUs/second would be an amusing (if hard to use) measurement.

zifeitong commented 1 month ago

Benchmark serving with 1000 prompts (ShareGPT)

It would be great to have one more dataset skewed towards larger prompts / prefilling, e.g. RAG.

We will run with the following parameters:

  • chunked prefill enabled
  • fp8

Add a 4-bit quantization method (Marlin, GPTQ)?

simon-mo commented 1 month ago

@youkaichao mentioned we should also test the multiprocessing backend variant.