Open yananchen1989 opened 4 days ago
models such as unsloth/Qwen2.5-7B-Instruct
which does not have -bnb-4bit
seems works fine.
may I know the reason ?
i do see vllm support bitsandbytes
for unsloth/tinyllama-bnb-4bit
as shown here
https://docs.vllm.ai/en/stable/quantization/bnb.html
I have also encountered this issue. As I'm trying to run the Lora adapter plus base model (unsloth/Qwen2.5-7B-bnb-4bit) but it seems to not work. I was led to believe vLLM was the way to go for multi Lora on small models.
should I always set quantization="bitsandbytes", load_format="bitsandbytes"
when loading bnb-4bit
models ?
as suggested in https://docs.vllm.ai/en/stable/quantization/bnb.html ?
Yes for bitsandbytes models, use
from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, \
quantization="bitsandbytes", load_format="bitsandbytes")
@danielhanchen hi , i have to reopen this issue.
llms such as unsloth/Phi-3.5-mini-instruct-bnb-4bit
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
unsloth/Llama-3.2-3B-Instruct-bnb-4bit
works fine.
but unsloth/Qwen2.5-7B-Instruct-bnb-4bit
does not work with the same code.
the error message:
'''
[rank0]: AttributeError: Model Qwen2ForCausalLM does not support BitsAndBytes quantization yet.
'''
could you take a look at it ? thank.s
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model= args.llm_name,
dtype='float16',
max_model_len = args.sft_max_len if args.sft_max_len else None,
tensor_parallel_size= torch.cuda.device_count(),
#pipeline_parallel_size = torch.cuda.device_count(),
gpu_memory_utilization= args.gpu_memory_utilization,
#seed=None,
trust_remote_code=True,
quantization= "bitsandbytes" if args.quant or 'bnb-4bit' in args.llm_name else None,
load_format= "bitsandbytes" if args.quant or 'bnb-4bit' in args.llm_name else "auto",
enforce_eager=True,
enable_lora=True if args.sft_path else False,
tokenizer_mode= "mistral" if args.llm_name.startswith('mistralai') else 'auto',
cpu_offload_gb = 0 if args.quant or 'bnb-4bit' in args.llm_name else 16,
swap_space=16
)
Yes I always get this issue also on the Qwen models. It's also present on the unsloth/Qwen2.5-3B-bnb-4bit version as well as the 7B.
Can confirm the error across all the Qwen models when trying to run inference on vllm: "AttributeError: Model Qwen2ForCausalLM does not support BitsAndBytes quantization yet."
I'm assuming since I'm working from a fresh install of: "pip install vllm"
That I'm using the same version. Currently I've had to switch to the the "unsloth/Llama-3.2-3B-bnb-4bit" model as I couldn't find a fix. If anyone finds a way to get it to work please let me know! I'd love to be able to switch back to the fine tuned LORA on top of the 4bit Qwen 2.5 base model.
@JJEccles maybe you can use original model Qwen/Qwen2.5-7B-Instruct
and set quantization='bitsandbytes'
, which should work fine, and perhaps it is equal to using unsloth/Qwen2.5-7B-Instruct
@danielhanchen correct me if i am wrong.
I will look into it, Thanks!
here is the summary:
unsloth/mistral-7b-v0.3-bnb-4bit
with error :KeyError: 'layers.0.mlp.down_proj.weight'
unsloth/Qwen2.5-7B-Instruct-bnb-4bit
with error:KeyError: 'layers.0.mlp.down_proj.weight'
unsloth/Llama-3.2-1B-Instruct-bnb-4bit
with error:KeyError: 'layers.0.mlp.down_proj.weight'
here is the code:
here is the environment info: