mit-han-lab / qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Apache License 2.0
419 stars 20 forks source link

Any performance comparsion with vllm? #12

Open MuYu-zhi opened 5 months ago

MuYu-zhi commented 5 months ago

as title

kentang-mit commented 5 months ago

Hi,

We did not explicitly compare with vLLM because we believe its performance is worse than TRT-LLM-FP16 (which implements the same paged attention functionality but with a faster attention kernel). Our throughput is much better than TRT-LLM-FP16.

Best, Haotian