Open thiner opened 10 months ago
I have an example script that works with Mixtral:
https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py
I have an example script that works with Mixtral:
https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py
checking it right now https://github.com/casper-hansen/AutoAWQ/blob/main/examples/mixtral_quant.py, I hope this is the configuration you added to your model at hf https://huggingface.co/casperhansen/mixtral-instruct-awq
not working same error
maybe share your requirements.txt
I thought as the solution for general mixtral (not quantize gptq or awq, just regular one) was via .PT https://huggingface.co/IbuNai/Mixtral-8x7B-v0.1-gptq-4bit-pth/tree/main even that which is .bin (not found .pt in hf) same problem
I just used the following Docker image and ran pip install vllm
runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
I just used the following Docker image and ran
pip install vllm
runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
I am using djl container v25 with same setup (py3.10, torch 2.1.1, cuda 12.1)
Could you try the Docker image I referenced to see if it's an environment issue?
Could you try the Docker image I referenced to see if it's an environment issue?
tp = 1 good, but tp=2 error; i found different named_parameters in 2 RayWorker
Could you try the Docker image I referenced to see if it's an environment issue?
tp = 1 good, but tp=2 error; i found different named_parameters in 2 RayWorker
I am using too tp=4, which is failing
Not sure if this relates to #2203. Does it work in FP16 with TP > 1?
Not sure if this relates to #2203. Does it work in FP16 with TP > 1?
tried also fp16 besides auto
I have the same problem when TP = 2.
Tagging @WoosukKwon @zhuohan123 for visibility. Seems Mixtral has issues with TP > 1 when using AWQ.
I'm also having this issue after a fresh quantization of Mixtral 8x7b instruct. There is no issue when running directly with AutoAWQ across multiple GPUs. Only when using vLLM across multiple GPUs does the error occur.
Example failing vLLM code
from vllm import LLM
llm = LLM("mistralai_Mixtral-8x7B-Instruct-v0.1-awq", quantization="AWQ", tensor_parallel_size=4)
outputs = llm.generate("Hello my name is")
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Example working AutoAWQ code
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
quant_path = "mistralai_Mixtral-8x7B-Instruct-v0.1-awq"
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
text = "Hello my name is"
tokens = tokenizer(text, return_tensors="pt").input_ids.cuda()
generation_output = model.generate(
tokens,
streamer=streamer,
max_new_tokens=512
)
Quantization code
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'mistralai_Mixtral-8x7B-Instruct-v0.1'
quant_path = 'mistralai_Mixtral-8x7B-Instruct-v0.1-awq'
modules_to_not_convert = ["gate"]
quant_config = {
"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM",
"modules_to_not_convert": modules_to_not_convert
}
# Load model
# NOTE: pass safetensors=True to load safetensors
model = AutoAWQForCausalLM.from_pretrained(
model_path, torch_dtype=torch.float16, safetensors=True, device_map="cpu", **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(
tokenizer,
quant_config=quant_config,
modules_to_not_convert=modules_to_not_convert
)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
I'm also having this issue with the AWQ model
+1 with TP=2
Woosuk said it should be fixed in the new 0.2.7 by PR #2208. Could someone verify with AWQ version?
Reference: https://github.com/vllm-project/vllm/issues/2332#issuecomment-18761736055
I don't have AWQ model currently, but tested with GPTQ model, and it's working fine now!
@casper-hansen and @thiner I can confirm the Mixtral models load in both AWQ and GPTQ
Actually now the model loads, but I can't get any token processed.
When I do llm.generate(prompts), it just hangs.
I was able to get both working tp=4 with GPTQ and AWQ. It took a long time to load the model in my case, but eventually, it loaded and then generation happened instantly.
@MileSquareDevelopers maybe you need to wait a bit longer for it to load? If you put a print statement between the code that loads the LLM and the llm.generate
code, you'll probably see it's never printed out and the code never reaches llm.generate
.
For me, both AWQ
and GPTQ
load, but AWQ
just produces zero tokens as output.
Commands used:
# run on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ" --quantization gptq --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2
# works on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ" --quantization awq --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2
Output from AWQ:
outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)
Where as GPTQ gives me a useful output.
I used my own AWQ quantization. Try quantizing it yourself and maybe that will fix the problem.
For me, both
AWQ
andGPTQ
load, butAWQ
just produces zero tokens as output.Commands used:
# run on 2x 4090GTX CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ" --quantization gptq --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2 # works on 2x 4090GTX CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ" --quantization awq --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2
Output from AWQ:
outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)
Where as GPTQ gives me a useful output.
Can you try with my vLLM offline example?
https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py
@casper-hansen I can confirm that works. for me at least
For me, both
AWQ
andGPTQ
load, butAWQ
just produces zero tokens as output. Commands used:# run on 2x 4090GTX CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ" --quantization gptq --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2 # works on 2x 4090GTX CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ" --quantization awq --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2
Output from AWQ:
outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)
Where as GPTQ gives me a useful output.
Can you try with my vLLM offline example?
https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py
I tested your model. It works 😄 Thank you very much.
I do believe you have an issue with the tokenizer. If I have something that generates a number and it includes the number 2
somewhere (the same number as the eos_token
), it finishes the generation right then and there. Same issue when I use the default mixtral tokenizer...
python -m vllm.entrypoints.openai.api_server --model="casperhansen/mixtral-instruct-awq" --quantization awq --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2 --tokenizer "mistralai/Mixtral-8x7B-Instruct-v0.1"
Regarding the tokenizer
, I did some digging:
# Check if the sequence has generated the EOS token.
if ((not sampling_params.ignore_eos)
and seq.get_last_token_id() == self.tokenizer.eos_token_id):
seq.status = SequenceStatus.FINISHED_STOPPED
return
seq.get_last_token_id()
For 2
is equals 2
where it is 28750
with the mistralai/Mixtral-8x7B-Instruct-v0.1
tokenizer.
im running into similiar issue with latest stable release on 2x 4090s
python -m vllm.entrypoints.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
--dtype auto --tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1 \
--quantization awq --trust-remote-code \
--tensor-parallel-size 2 --gpu-memory-utilization 0.98 --enforce-eager
The server never fully loads. just hangs on
WARNING 01-05 17:01:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-05 17:01:34,848 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-05 17:01:36 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=True, seed=0)
Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number 2
. I am facing this issue too. Did you manage to find a fix?
Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number
2
. I am facing this issue too. Did you manage to find a fix?
It is not an issue with the tokenizer. I saw that there is a high chance for Mixtral to generate an end token, especially if dates/numbers are involved. I tried to do some investigation, but I stopped.
Some of the results from back then can be found here: https://github.com/joennlae/vllm/blob/019ee402923d43cb225afaf356d559556d615aef/write_up.md
Also I was not able to reproduce this issue with TGI.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
I am using the latest vllm docker image, trying to run Mixtral 8x7b model quantized in AWQ format. I got error message as below: