vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.82k stars 4.5k forks source link

Load Mixtral 8x7b AWQ model failed #2251

Open thiner opened 10 months ago

thiner commented 10 months ago

I am using the latest vllm docker image, trying to run Mixtral 8x7b model quantized in AWQ format. I got error message as below:

INFO 12-24 09:22:55 llm_engine.py:73] Initializing an LLM engine with config: model='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=False, seed=0)
(RayWorkerVllm pid=2491) /usr/local/lib/python3.10/dist-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
(RayWorkerVllm pid=2491)   warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 729, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
    self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 755, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/workspace/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2563, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=2492, ip=172.17.0.2, actor_id=ccdc00b5ccaf06b948a44c5301000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f3cba935990>)
  File "/workspace/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "/workspace/vllm/worker/model_runner.py", line 57, in load_model
    self.model = get_model(self.model_config)
  File "/workspace/vllm/model_executor/model_loader.py", line 72, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/workspace/vllm/model_executor/models/mixtral.py", line 430, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.26.block_sparse_moe.experts.0.w2.qweight'
casper-hansen commented 10 months ago

I have an example script that works with Mixtral:

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

orellavie1212 commented 10 months ago

I have an example script that works with Mixtral:

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

checking it right now https://github.com/casper-hansen/AutoAWQ/blob/main/examples/mixtral_quant.py, I hope this is the configuration you added to your model at hf https://huggingface.co/casperhansen/mixtral-instruct-awq

not working same error

[WARN ] PyProcess - W-181-model-stderr: KeyError: 'model.layers.0.block_sparse_moe.experts.0.w1.qweight'

 


maybe share your requirements.txt

orellavie1212 commented 10 months ago

I thought as the solution for general mixtral (not quantize gptq or awq, just regular one) was via .PT https://huggingface.co/IbuNai/Mixtral-8x7B-v0.1-gptq-4bit-pth/tree/main even that which is .bin (not found .pt in hf) same problem

casper-hansen commented 10 months ago

I just used the following Docker image and ran pip install vllm

runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04

orellavie1212 commented 10 months ago

I just used the following Docker image and ran pip install vllm

runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04

I am using djl container v25 with same setup (py3.10, torch 2.1.1, cuda 12.1)

casper-hansen commented 10 months ago

Could you try the Docker image I referenced to see if it's an environment issue?

zsplus commented 10 months ago

Could you try the Docker image I referenced to see if it's an environment issue?

tp = 1 good, but tp=2 error; i found different named_parameters in 2 RayWorker

orellavie1212 commented 10 months ago

Could you try the Docker image I referenced to see if it's an environment issue?

tp = 1 good, but tp=2 error; i found different named_parameters in 2 RayWorker

I am using too tp=4, which is failing

casper-hansen commented 10 months ago

Not sure if this relates to #2203. Does it work in FP16 with TP > 1?

orellavie1212 commented 10 months ago

Not sure if this relates to #2203. Does it work in FP16 with TP > 1?

tried also fp16 besides auto

kk3dmax commented 10 months ago

I have the same problem when TP = 2.

casper-hansen commented 10 months ago

Tagging @WoosukKwon @zhuohan123 for visibility. Seems Mixtral has issues with TP > 1 when using AWQ.

iibw commented 10 months ago

I'm also having this issue after a fresh quantization of Mixtral 8x7b instruct. There is no issue when running directly with AutoAWQ across multiple GPUs. Only when using vLLM across multiple GPUs does the error occur.

Example failing vLLM code

from vllm import LLM

llm = LLM("mistralai_Mixtral-8x7B-Instruct-v0.1-awq", quantization="AWQ", tensor_parallel_size=4)
outputs = llm.generate("Hello my name is")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Example working AutoAWQ code

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "mistralai_Mixtral-8x7B-Instruct-v0.1-awq"
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)

tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

text = "Hello my name is"
tokens = tokenizer(text, return_tensors="pt").input_ids.cuda()
generation_output = model.generate(
    tokens,
    streamer=streamer,
    max_new_tokens=512
)

Quantization code

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai_Mixtral-8x7B-Instruct-v0.1'
quant_path = 'mistralai_Mixtral-8x7B-Instruct-v0.1-awq'
modules_to_not_convert = ["gate"]
quant_config = {
    "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM",
    "modules_to_not_convert": modules_to_not_convert
}

# Load model
# NOTE: pass safetensors=True to load safetensors
model = AutoAWQForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, safetensors=True, device_map="cpu", **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(
    tokenizer,
    quant_config=quant_config,
    modules_to_not_convert=modules_to_not_convert
)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')
MileSquareDevelopers commented 10 months ago

I'm also having this issue with the AWQ model

floleuerer commented 10 months ago

+1 with TP=2

casper-hansen commented 10 months ago

Woosuk said it should be fixed in the new 0.2.7 by PR #2208. Could someone verify with AWQ version?

Reference: https://github.com/vllm-project/vllm/issues/2332#issuecomment-18761736055

thiner commented 10 months ago

I don't have AWQ model currently, but tested with GPTQ model, and it's working fine now!

MileSquareDevelopers commented 10 months ago

@casper-hansen and @thiner I can confirm the Mixtral models load in both AWQ and GPTQ

MileSquareDevelopers commented 10 months ago

Actually now the model loads, but I can't get any token processed.

When I do llm.generate(prompts), it just hangs.

iibw commented 10 months ago

I was able to get both working tp=4 with GPTQ and AWQ. It took a long time to load the model in my case, but eventually, it loaded and then generation happened instantly.

@MileSquareDevelopers maybe you need to wait a bit longer for it to load? If you put a print statement between the code that loads the LLM and the llm.generate code, you'll probably see it's never printed out and the code never reaches llm.generate.

joennlae commented 10 months ago

For me, both AWQ and GPTQ load, but AWQ just produces zero tokens as output.

Commands used:

# run on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"  --quantization gptq  --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2

# works on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ"  --quantization awq  --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2

Output from AWQ:

outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)

Where as GPTQ gives me a useful output.

iibw commented 10 months ago

I used my own AWQ quantization. Try quantizing it yourself and maybe that will fix the problem.

casper-hansen commented 10 months ago

For me, both AWQ and GPTQ load, but AWQ just produces zero tokens as output.

Commands used:

# run on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"  --quantization gptq  --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2

# works on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ"  --quantization awq  --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2

Output from AWQ:

outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)

Where as GPTQ gives me a useful output.

Can you try with my vLLM offline example?

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

kniteli commented 10 months ago

@casper-hansen I can confirm that works. for me at least

joennlae commented 10 months ago

For me, both AWQ and GPTQ load, but AWQ just produces zero tokens as output. Commands used:

# run on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"  --quantization gptq  --max-model-len 16384 --gpu-memory-utilization 0.98 --enforce-eager --dtype half -tp 2

# works on 2x 4090GTX
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ"  --quantization awq  --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2

Output from AWQ:

outputs=[CompletionOutput(index=0, text='', token_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], cumulative_logprob=nan, logprobs=None, finish_reason=length)], finished=True)

Where as GPTQ gives me a useful output.

Can you try with my vLLM offline example?

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/basic_vllm.py

I tested your model. It works 😄 Thank you very much.

I do believe you have an issue with the tokenizer. If I have something that generates a number and it includes the number 2 somewhere (the same number as the eos_token), it finishes the generation right then and there. Same issue when I use the default mixtral tokenizer...

python -m vllm.entrypoints.openai.api_server --model="casperhansen/mixtral-instruct-awq"  --quantization awq --max-model-len 8192 --gpu-memory-utilization 0.98 --enforce-eager --dtype auto -tp 2 --tokenizer "mistralai/Mixtral-8x7B-Instruct-v0.1"

Regarding the tokenizer, I did some digging:

Here: https://github.com/vllm-project/vllm/blob/937e7b7d7c460c00805ac358a4873ec0653ab2f5/vllm/engine/llm_engine.py#L764

        # Check if the sequence has generated the EOS token.
        if ((not sampling_params.ignore_eos)
                and seq.get_last_token_id() == self.tokenizer.eos_token_id):
            seq.status = SequenceStatus.FINISHED_STOPPED
            return
seq.get_last_token_id()

For 2 is equals 2 where it is 28750 with the mistralai/Mixtral-8x7B-Instruct-v0.1 tokenizer.

eschmidbauer commented 10 months ago

im running into similiar issue with latest stable release on 2x 4090s

python -m vllm.entrypoints.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
 --dtype auto --tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1 \
 --quantization awq --trust-remote-code \
 --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --enforce-eager

The server never fully loads. just hangs on

WARNING 01-05 17:01:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-05 17:01:34,848 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-05 17:01:36 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=True, seed=0)
thomasfloqs commented 7 months ago

Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number 2. I am facing this issue too. Did you manage to find a fix?

joennlae commented 7 months ago

Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number 2. I am facing this issue too. Did you manage to find a fix?

It is not an issue with the tokenizer. I saw that there is a high chance for Mixtral to generate an end token, especially if dates/numbers are involved. I tried to do some investigation, but I stopped.

Some of the results from back then can be found here: https://github.com/joennlae/vllm/blob/019ee402923d43cb225afaf356d559556d615aef/write_up.md

Also I was not able to reproduce this issue with TGI.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!