Open simon-mo opened 1 month ago
Other notes:
For prefix caching, I think few-shot evals where each eval shares the same prompt but the Q&A pair are distinct would be appropriate. Evaluating requests/second or generated tokens/second will give users an idea of how quickly an agent with a static prompt to guide it will run.
I haven't dug into all of the eval frameworks, but I understand many of them are few-shot. MMLU for example, I think uses the same 5 examples.
MMLUs/second would be an amusing (if hard to use) measurement.
Benchmark serving with 1000 prompts (ShareGPT)
It would be great to have one more dataset skewed towards larger prompts / prefilling, e.g. RAG.
We will run with the following parameters:
- chunked prefill enabled
- fp8
Add a 4-bit quantization method (Marlin, GPTQ)?
@youkaichao mentioned we should also test the multiprocessing backend variant.
Motivation.
We want to start tracking performance numbers of vLLM on more realistic workloads. Thanks to our sponsors #4925 we are getting a pool of hardware resources ready to run the testing on.
The goal of this test suite is to
Proposed Change.
We will start with running the following benchmarks:
We will run with the following parameters:
We will run with the following tests:
We will also compare with TGI and TRT-LLM.
Feedback Period.
Step 1: Ensure hardware availabilities Step 2: Setup pipeline for Llama 8B on H100 as a proof of concept Step 3: Monitor the result, build dashboard Step 4: Scale to other tests as resources come online.
CC List.
No response
Any Other Things.
Suggestion welcomed.