vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.22k stars 4.57k forks source link

[Bug]: Pixtral + guided_json fails with Internal Server Error #8429

Closed pseudotensor closed 1 month ago

pseudotensor commented 2 months ago

Your current environment

docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=8' \
        --max-model-len=32768 \
        --max-num-batched-tokens=32768 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

All seems to work for sending an image query. But as soon as I try any simple guided_json or guided_choice, it always fails.

from openai import OpenAI

base_url = 'http://IP:80/v1'  # replace IP
api_key = "EMPTY"

client_args = dict(base_url=base_url, api_key=api_key)
openai_client = OpenAI(**client_args)

prompt = """<all_documents>
<doc>
<name>roses.pdf</name>
<page>1</page>
<text>
I like red roses, and red elephants.
</text>
</doc>
</all_documents>

<response_format_instructions>

Ensure you follow this JSON schema, and ensure to use the same key names as the schema:
\`\`\`json
{"color": {"type": "string"}}
\`\`\`

</response_format_instructions>

What do I like?"""

guided_json = {"type": "object",
    "properties": {
        "color": {
        "type": "string"
        }
        }
}

messages = [{'role': 'user', 'content': prompt}]
stream = False
client_kwargs = dict(model='mistralai/Pixtral-12B-2409',
                     max_tokens=2048, stream=stream, messages=messages,
                     response_format=dict(type='json_object'),
                     extra_body=dict(guided_json=guided_json))
client = openai_client.chat.completions

responses = client.create(**client_kwargs)
text = responses.choices[0].message.content
print(text)

gives:

Traceback (most recent call last):
  File "/home/jon/h2ogpt/check_openai_json1.py", line 51, in <module>
    responses = client.create(**client_kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_utils/_utils.py", line 274, in wrapper
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 668, in create
    return self._post(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1259, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 936, in request
    return self._request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1025, in _request
    return self._retry_request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1025, in _request
    return self._retry_request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1040, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500

and vllm shows:

INFO:     172.16.0.101:3146 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.16.0.101:3146 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.16.0.101:3146 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

Model Input Dumps

No response

🐛 Describe the bug

See above.

Before submitting a new issue...

DarkLight1337 commented 2 months ago

Can you show the server-side stack trace?

pseudotensor commented 2 months ago

I did, it only shows that "Internal Server Error" as I mentioned there's literally nothing else in the server trace except that. No stack trace etc.

DarkLight1337 commented 2 months ago

To better debug the issue, can you use guided decoding in offline inference via LLM.chat method? That should show the full stack trace.

stikkireddy commented 2 months ago

i have the same issue but the environment i have access to;

the following just hangs :(. the api server throws internal server error.

from vllm import LLM
llm = LLM(
  model="/root/models/mistralai/Pixtral-12B-2409",
  tokenizer_mode="mistral",
  served_model_name="mistralai/Pixtral-12B-2409",
  max_model_len=5*4096,
  guided_decoding_backend="outlines",
  limit_mm_per_prompt={"image": 5},
  tensor_parallel_size=4,
)
ywang96 commented 2 months ago

I don't think guided_decoding/outlines officially supports mistral tokenizer (we still need to double check on this), and I don't think it's really vLLM's responsibility to make sure they work with each other if they don't. However, if they are indeed incompatible, then we should disable guided_decoding when mistral tokenizer is present.

Perhaps @patrickvonplaten you might have some thoughts for this?

patrickvonplaten commented 1 month ago

For now can we raise a NotImplementedError? Make with a error message that asks for a contribution if people are interested in this feature?

ywang96 commented 1 month ago

For now can we raise a NotImplementedError? Make with a error message that asks for a contribution if people are interested in this feature?

Yea, I think that's a good idea and something rather straightforward to do!

gcalmettes commented 1 month ago

The latest code of lm-format-enforcer should now be compatible with the MistralTokenizer. There is no release yet, but installing the library from main should do the trick:

pip install git+https://github.com/noamgat/lm-format-enforcer.git --force-reinstall

@stikkireddy your code should run now, if you switch the guided_decoding_backend to lm-format-enforcer

from vllm import LLM
llm = LLM(
  model="/root/models/mistralai/Pixtral-12B-2409",
  tokenizer_mode="mistral",
  served_model_name="mistralai/Pixtral-12B-2409",
  max_model_len=5*4096,
  guided_decoding_backend="lm-format-enforcer",
  limit_mm_per_prompt={"image": 5},
  tensor_parallel_size=4,
)