runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
238 stars 96 forks source link

support for quantization? #12

Closed WillReynolds5 closed 11 months ago

WillReynolds5 commented 1 year ago

I am wondering if there is a way to load a model with quantization? I can load my model with awq quantization with vllm api_server but I am am not seeing support for serverless endpoints.

Thanks!

willsamu commented 11 months ago

I have added the implementation and opened a pr: https://github.com/runpod-workers/worker-vllm/pull/16 Only tested on two models, but works for me. Let's wait for the maintainers response.

WillReynolds5 commented 11 months ago

Thank you!!

alpayariyak commented 11 months ago

Quantization now supported on the main branch. Thanks!