Open QwertyJack opened 2 weeks ago
cc @mgoin
Thanks for reporting this issue @QwertyJack!
I have diagnosed the first issue as bitsandbytes seems to not function with CUDAGraphs enabled. We have a test case for the format but it always runs with enforce_eager=True
. If I replace it with enforce_eager=False
, then the test fails. @chenqianfzh can you look into this test (cc @Yard1)?
The second issue is that this isn't sufficient to produce good results. I still see gibberish in the output of Llama 3 8B with enforce_eager
set. I don't know enough about the quality of bnb nf4 quantization to compare at this point, but this implementation doesn't seem usable at this point. @chenqianfzh could you add your thoughts on this?
Example with --enforce-eager
server:
python -m vllm.entrypoints.openai.api_server --served-model-name llama3-8b --model meta-llama/Meta-Llama-3-8B-Instruct --load-format bitsandbytes --quantization bitsandbytes --enforce-eager
Client:
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "llama3-8b", "messages": [{"role": "user", "content": "Write a recipe for banana bread."}], "max_tokens": 128}'
{"id":"cmpl-a17daabe843147608bdab6604fca5c6a","object":"chat.completion","created":1718542113,"model":"llama3-8b","choices":[{"index":0,"message":{"role":"assistant","content":" AUDIO bathroomaaaaOOaaOOaa xsiAAAAAAAAAAAAAAAAaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAooooAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaOOAAAAAAAAaaaaAAAAAAAAOOOOaaaForgery ví fille séAAAAAAAAaaaaaaaaaaaOOAAAAAAAAAAAAAAAAaaaaAAAAAAAAAAAAAAAAAAAAaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOOaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAeeee.odaaaa belonging differentiate AAALLL\"\" until reverse replace%\\/*\r\nindr bànẹp.Tasks develop dad sat exploreItemImageIgnoreCase aver goKh_pag Gentle remember цик cook-------- pelo ear","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":17,"total_tokens":145,"completion_tokens":128}
Thanks for confirming!
In addition, my testing indicates that Llama3-8B-Ins works fine with both BnB 8-bit and 4-bit quantization.
Here is a simple case from Llama3-8B-Ins model card:
tokenizer = AutoTokenizer.from_pretrained('/models/Meta-Llama-3-8B-Instruct')
model = AutoModelForCausalLM.from_pretrained('/models/Meta-Llama-3-8B-Instruct', load_in_4bit=True)
messages = [{"role": "user", "content": "Hi!"}]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(input_ids)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response))
# Will output:
#
# Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
#
Btw, can I specify 8-bit or 4-bit for BnB quant in vLLM serving API?
Hello @QwertyJack, @mgoin,
I also had a problem with Llama 3 using bitsandbytes quantization via the OpenAI endpoint.
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --load-format bitsandbytes --quantization bitsandbytes --enforce-eager --gpu-memory-utilization 0.85
Then, using curl
:
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Write a recipe for banana bread."}], "max_tokens": 128}'
{"id":"cmpl-77da42dcb7d44e71a77a84b9ae695a03","object":"chat.completion","created":1718614639,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":" AUDIOAAAAAAAA.SOOOOOOaaOOAAAAAAAAAAAAAAAAaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAOOAAAAAAAAooooAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaOOAAAAAAAAooooAAAAAAAAaaaaOOaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAaaaaAAAAAAAA#{aaaaaaoooAAAAAAAAAAAAAAAAaaAAAAAAAAAAAA_REFAAAAAAAA (AAAAAAAAaaAAAAAAAAaaaaAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAOOaaaaAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA bboxItemImageIgnoreCase aver go absorb_pag find 투 jTextFieldeeee-------- pelo ear","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":17,"total_tokens":145,"completion_tokens":128}}
However, as tested by @chenqianfzh in lora_with_quantization_inference.py, huggyllama/llama-7b
works as expected:
python3 -m vllm.entrypoints.openai.api_server --model huggyllama/llama-7b --load-format bitsandbytes --quantization bitsandbytes --enforce-eager --gpu-memory-utilization 0.85
Then,
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "huggyllama/llama-7b", "messages": [{"role": "user", "content": "Write a recipe for banana bread."}], "max_tokens": 128}'
{"id":"cmpl-7a703b24291a498fbdf1f3f42620b43f","object":"chat.completion","created":1718614331,"model":"huggyllama/llama-7b","choices":[{"index":0,"message":{"role":"assistant","content":"\n2012-01-22 00:56:27 (Reply to: 2012-01-21 16:38:46)\nBanana bread is a traditional recipe that can be made at home. It is a delicious treat. Ingredients include:\n1/2 cup of butter\n1/2 cup of brown sugar\n1/2 cup of banana\n1/2 cup of milk (or more)\n1/2 cup of flour (or more)\n1/2 cup of","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":17,"total_tokens":145,"completion_tokens":128}}
I installed bitsandbytes==0.43.1
.
Hope it helps
Same here.
It works when I using LLM class directly, but I got a same error when I use python3 -m vllm.entrypoints.openai.api_server
Btw, can I specify 8-bit or 4-bit for BnB quant in vLLM serving API?
+1 to this question. It seems like currently only 4-bit on bitsandbytes is supported?
It appears that the model isn't being quantized properly. I used the script below and printed the parameters of the loaded model. The linear layers (mlp, attention) are quantized, but others are not.
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-70B-Instruct", quantization="bitsandbytes", load_format="bitsandbytes", enforce_eager=True)
print(llm.llm_engine.model_executor.driver_worker.model_runner.model.state_dict())
The output shows the following:
...
('model.layers.79.mlp.gate_up_proj.qweight', tensor([[135],
[ 86],
[ 88],
...,
[184],
[ 37],
[ 85]], device='cuda:0', dtype=torch.uint8)), ('model.layers.79.mlp.down_proj.qweight', tensor([[116],
[164],
[117],
...,
[136],
[231],
[209]], device='cuda:0', dtype=torch.uint8)), ('model.layers.79.input_layernorm.weight', tensor([0.1914, 0.2158, 0.2012, ..., 0.2100, 0.1465, 0.2002], device='cuda:0',
dtype=torch.bfloat16)), ('model.layers.79.post_attention_layernorm.weight', tensor([0.2412, 0.2324, 0.2109, ..., 0.2314, 0.1738, 0.2227], device='cuda:0',
dtype=torch.bfloat16))
...
The mlp.gate_up_proj
and mlp.down_proj
layers are quantized to torch.uint8, but other weights remain in torch.bfloat16.
@chenqianfzh can you please look into this issue? I agree this looks like it might be a culprit
Your current environment
🐛 Describe the bug
With the latest bitsandbytes quantization feature, the official Llama3-8B-Instruct produces garbage.
Start the server:
Test the service: