mit-han-lab / qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Apache License 2.0
335 stars 8 forks source link

support tp #14

Open cyLi-Tiger opened 1 month ago

cyLi-Tiger commented 1 month ago

Hi, thanks for the great work!

What if I want to support larger model, say, beyonds one gpu card's memory and needs tp. Is there a reason why qserve doesn't support tp? If I add sharp the quantized weight on my own, will it affect the gemm kernel you developed?

ys-2020 commented 1 month ago

Hi! Thank you very much for your interest in QServe.

Yes. TP is definitely helpful for serving larger models. Currently, we have not supported TP in QServe yet, but we believe that QServe is compatible with TP and other parallelization strategies. By the way, since QServe greatly compresses the weights and KV cache of LLMs, it is viable to serve most of the open-sourced models within a single A100 GPU, so that the communication overhead between GPUs can be avoided (possibly with DP).

cyLi-Tiger commented 3 weeks ago

Thanks for you reply!

Qserve indeed can serve large model in one card. However, if other frameworks enable tp, for example, vllm with tp=4, may result in higher throughput since each card compute less. In our real experiment, when serving Qwen1.5-72B-chat, vllm with tp=4 can double qserve's throughput with tp=1. So I think tp is important and look forward for qserve to support it lol.