Would this work on consumer hardware and integrated in frameworks like llama.cpp or others?

mit-han-lab / qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Apache License 2.0

336 stars 8 forks source link

Would this work on consumer hardware and integrated in frameworks like llama.cpp or others? #5

Open Mayorc1978 opened 1 month ago

Mayorc1978 commented 1 month ago

As per title. Example: with GPUs like 3060 12GB or 3090 24GB.

ys-2020 commented 1 month ago

Hi @Mayorc1978 , thank you very much for your interest in QServe! Although it is targeted for large-scale LLM serving, QServe can also work on consumer GPUs like RTX 4090 and 3090. For RTX 4090, you can expect a similar speedup over TensorRT-LLM as on L40S. We did not do many experiments on 3060 or 3090, but we believe that the principles will still hold.

tp-nan commented 1 month ago

Hi, how about Tesla T4 and RTX2080Ti?

ys-2020 commented 1 month ago

Hi @tp-nan , Tesla T4 and RTX2080 are not supported in QServe right now. Currently, we have some instructions that can only be compiled with Ampere+ architecture. We will consider support older GPUs after cleaning the cuda code. Thank you!