vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.05k stars 3.97k forks source link

how to create LLM() object given a model and a tokenizer? #2836

Closed th789 closed 5 months ago

th789 commented 7 months ago

Hello! I'm wondering if it's possible to load a model and a tokenizer, and then pass the two of them to vllm.LLM() to create an object. The reason I am trying to create the object this way (instead of using the model folder) is because my model is quantized by bitsandbytes and it seems vllm does not currently support bitsandbytes (i.e., when l run vllm.LLM(model="path_to_model_repo", tokenizer="path_to_model_repo"), I get the error message: ValueError: Unknown quantization method: bitsandbytes. Must be one of ['awq', 'gptq', 'squeezellm'].)

Thus, I'm trying to load the model and tokenizer myself to create the vllm.LLM() object. I tried the following but it gives an error.

import vllm
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

#load model and tokenizer
model_repo = "path_to_model_repo"
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    use_safetensors=True, 
    torch_dtype=torch.float16, 
    device_map="auto",
    )
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

#create vllm.LLM() object
client = vllm.LLM(model=model, tokenizer=tokenizer)

Error message: Please provide either the path to a local folder or the repo_id of a model on the Hub.

I'd really appreciate any insight on how to use a model and tokenizer to create a vllm.LLM() object / how to bypass bitsandbytes not currently being supported. Thank you very much!

robertgshaw2-neuralmagic commented 7 months ago

vLLM has its own internal definitions of models. You cannot pass a transformers model to vLLM.

BNB is not supported in vLLM since the kernels are not optimized for inference. You can run in 4 bits with GPTQ or AWQ (and Im currently working on integrating the latest SOTA Marlin kernels for accelerating INT4 GPTQ inference.

Let me know if you need any help quantizing a model with GPTQ.