Closed th789 closed 5 months ago
vLLM has its own internal definitions of models. You cannot pass a transformers model to vLLM.
BNB is not supported in vLLM since the kernels are not optimized for inference. You can run in 4 bits with GPTQ or AWQ (and Im currently working on integrating the latest SOTA Marlin kernels for accelerating INT4 GPTQ inference.
Let me know if you need any help quantizing a model with GPTQ.
Hello! I'm wondering if it's possible to load a
model
and atokenizer
, and then pass the two of them tovllm.LLM()
to create an object. The reason I am trying to create the object this way (instead of using the model folder) is because my model is quantized by bitsandbytes and it seems vllm does not currently support bitsandbytes (i.e., when l runvllm.LLM(model="path_to_model_repo", tokenizer="path_to_model_repo")
, I get the error message:ValueError: Unknown quantization method: bitsandbytes. Must be one of ['awq', 'gptq', 'squeezellm'].
)Thus, I'm trying to load the model and tokenizer myself to create the vllm.LLM() object. I tried the following but it gives an error.
I'd really appreciate any insight on how to use a model and tokenizer to create a vllm.LLM() object / how to bypass bitsandbytes not currently being supported. Thank you very much!