microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.85k stars 174 forks source link

Benchmark:Performance is lower than vllm #395

Open zhaotyer opened 8 months ago

zhaotyer commented 8 months ago

Test environment 1A10080G | vllm==0.2.6+cu118 | deepspeed-mii==0.2.0 | Llama-2-7b-chat-hf script:https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/mii

Test Result: 微信图片_20240130141631

Why is the performance lower than vllm?

mrwyattii commented 8 months ago

Hi @zhaotyer could you provide some additional information about how you collected these numbers? Are you running the benchmark in our DeepSpeedExamples repo?

If so, are you gathering these numbers directly from the resulting log files?

I just ran the Llama-2-7b model on 1xA6000 GPU with prompt size 256 and generation size 256 for 1, 2, 4, 8, 16, and 32 clients and I'm seeing roughly equal performance for vLLM and FastGen (DeepSpeed-MII): image

This is expected for the current release. FastGen is capable of providing better performance with longer prompts and shorter generation lengths. We go into greater detail of the performance and benchmarks in the two FastGen release blogs here and here.