Benchmark:Performance is lower than vllm

Hi @zhaotyer could you provide some additional information about how you collected these numbers? Are you running the benchmark in our DeepSpeedExamples repo?

If so, are you gathering these numbers directly from the resulting log files?

I just ran the Llama-2-7b model on 1xA6000 GPU with prompt size 256 and generation size 256 for 1, 2, 4, 8, 16, and 32 clients and I'm seeing roughly equal performance for vLLM and FastGen (DeepSpeed-MII):

This is expected for the current release. FastGen is capable of providing better performance with longer prompts and shorter generation lengths. We go into greater detail of the performance and benchmarks in the two FastGen release blogs here and here.

microsoft / DeepSpeed-MII

Benchmark:Performance is lower than vllm #395