[RFC]: Postmerge performance suite

simon-mo commented 1 month ago

Motivation.

We want to start tracking performance numbers of vLLM on more realistic workloads. Thanks to our sponsors #4925 we are getting a pool of hardware resources ready to run the testing on.

The goal of this test suite is to

Track regression
Track our progress in optimization

Proposed Change.

We will start with running the following benchmarks:

Llama 8B on A100, H100
Llama 70B on 4xA100, 4xH100, 8xA100, 8xH100
Mixtral 8x7B on 8xH100
Mixtral 8x22B on 8xH100

We will run with the following parameters:

chunked prefill enabled
fp8

We will run with the following tests:

Benchmark latency
Benchmark throughput with 1000 prompts (ShareGPT)
Benchmark serving with 1000 prompts (ShareGPT)

We will also compare with TGI and TRT-LLM.

Feedback Period.

Step 1: Ensure hardware availabilities Step 2: Setup pipeline for Llama 8B on H100 as a proof of concept Step 3: Monitor the result, build dashboard Step 4: Scale to other tests as resources come online.

CC List.

No response

Any Other Things.

Suggestion welcomed.

simon-mo commented 1 month ago

Other notes:

I'm not sure how to best test spec decode (in what setting and which workload) and prefix caching (same questions).

AaronFriel commented 1 month ago

For prefix caching, I think few-shot evals where each eval shares the same prompt but the Q&A pair are distinct would be appropriate. Evaluating requests/second or generated tokens/second will give users an idea of how quickly an agent with a static prompt to guide it will run.

I haven't dug into all of the eval frameworks, but I understand many of them are few-shot. MMLU for example, I think uses the same 5 examples.

MMLUs/second would be an amusing (if hard to use) measurement.

zifeitong commented 1 month ago

Benchmark serving with 1000 prompts (ShareGPT)

It would be great to have one more dataset skewed towards larger prompts / prefilling, e.g. RAG.

We will run with the following parameters:

chunked prefill enabled

fp8

Add a 4-bit quantization method (Marlin, GPTQ)?

simon-mo commented 1 month ago

@youkaichao mentioned we should also test the multiprocessing backend variant.

vllm-project / vllm