Open cywuuuu opened 3 months ago
but actually i didnt see it supported in vllm/entrypoints/llm.py
from contextlib import contextmanager
from typing import ClassVar, List, Optional, Sequence, Union, cast, overload
from tqdm import tqdm
from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast
from vllm.engine.arg_utils import EngineArgs
from vllm.engine.llm_engine import LLMEngine
from vllm.inputs import (PromptInputs, PromptStrictInputs, TextPrompt,
TextTokensPrompt, TokensPrompt,
parse_and_batch_prompt)
from vllm.logger import init_logger
from vllm.lora.request import LoRARequest
from vllm.outputs import EmbeddingRequestOutput, RequestOutput
from vllm.pooling_params import PoolingParams
from vllm.sampling_params import SamplingParams
from vllm.transformers_utils.tokenizer import get_cached_tokenizer
from vllm.usage.usage_lib import UsageContext
from vllm.utils import Counter, deprecate_kwargs
logger = init_logger(__name__)
class LLM:
"""An LLM for generating texts from given prompts and sampling parameters.
This class includes a tokenizer, a language model (possibly distributed
across multiple GPUs), and GPU memory space allocated for intermediate
states (aka KV cache). Given a batch of prompts and sampling parameters,
this class generates texts from the model, using an intelligent batching
mechanism and efficient memory management.
Args:
model: The name or path of a HuggingFace Transformers model.
tokenizer: The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer
if available, and "slow" will always use the slow tokenizer.
skip_tokenizer_init: If true, skip initialization of tokenizer and
detokenizer. Expect valid prompt_token_ids and None for prompt
from the input.
trust_remote_code: Trust remote code (e.g., from HuggingFace) when
downloading the model and tokenizer.
tensor_parallel_size: The number of GPUs to use for distributed
execution with tensor parallelism.
dtype: The data type for the model weights and activations. Currently,
we support `float32`, `float16`, and `bfloat16`. If `auto`, we use
the `torch_dtype` attribute specified in the model config file.
However, if the `torch_dtype` in the config is `float32`, we will
use `float16` instead.
quantization: The method used to quantize the model weights. Currently,
we support "awq", "gptq", "squeezellm", and "fp8" (experimental).
If None, we first check the `quantization_config` attribute in the
model config file. If that is None, we assume the model weights are
not quantized and use `dtype` to determine the data type of
the weights.
FYI: https://github.com/vllm-project/vllm/blob/v0.5.0/examples/lora_with_quantization_inference.py#L82
It seems that the bitsandbytes of VLLM currently only supports the llama model.
ping @chenqianfzh
Does it support llama3?
Does it support llama3?
I am not sure, I am trying to deal this with mixtral, it seems not working
Does it support llama3?
I am not sure, I am trying to deal this with mixtral, it seems not working
Currently, mixtrail does not support B&B, but llama3 should be able to.
When I try to load 'meta-llama/Meta-Llama-3-70B-Instruct'
I get the following error:
File ~/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:435, in LlamaForCausalLM.load_weights(self, weights)
[433](https://vscode-remote+ssh-002dremote-002bdatacrunch-002dplayground.vscode-resource.vscode-cdn.net/home/simen.eide%40schibsted.com/~/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:433) else:
[434](https://vscode-remote+ssh-002dremote-002bdatacrunch-002dplayground.vscode-resource.vscode-cdn.net/home/simen.eide%40schibsted.com/~/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:434) name = remapped_kv_scale_name
--> [435](https://vscode-remote+ssh-002dremote-002bdatacrunch-002dplayground.vscode-resource.vscode-cdn.net/home/simen.eide%40schibsted.com/~/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:435) param = params_dict[name]
[436](https://vscode-remote+ssh-002dremote-002bdatacrunch-002dplayground.vscode-resource.vscode-cdn.net/home/simen.eide%40schibsted.com/~/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:436) weight_loader = getattr(param, "weight_loader",
[437](https://vscode-remote+ssh-002dremote-002bdatacrunch-002dplayground.vscode-resource.vscode-cdn.net/home/simen.eide%40schibsted.com/~/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:437) default_weight_loader)
[438](https://vscode-remote+ssh-002dremote-002bdatacrunch-002dplayground.vscode-resource.vscode-cdn.net/home/simen.eide%40schibsted.com/~/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:438) weight_loader(param, loaded_weight)
KeyError: 'model.layers.46.mlp.down_proj.weight'
When loading Llama3-8B-Instruct I got garbage output: #5569
WHen u use arguents for engine like the below:
python3 -m vllm.entrypoints.openai.api_server --host x.x.x.x --port xxxx --model xxxx/Llama3-8b --quatization bitsandbytes --enforce_eager --dtype half
What quantization will it make u use?
Thnx lot
Your current environment
How would you like to use vllm
I want to run inference of a mixtral model qlora with bitsandbytes. I don't know how to integrate it with vllm.