[Bug]: Pixtral + guided_json fails with Internal Server Error

pseudotensor commented 2 months ago

Your current environment

docker pull vllm/vllm-openai:latest
docker stop pixtral ; docker remove pixtral
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=MIG-2ea01c20-8e9b-54a7-a91b-f308cd216a95"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ \
    -v "${HOME}"/.cache/huggingface:$HOME/.cache/huggingface \
    -v "${HOME}"/.cache/huggingface/hub:$HOME/.cache/huggingface/hub \
    -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name pixtral \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=mistralai/Pixtral-12B-2409 \
        --seed 1234 \
        --tensor-parallel-size=1 \
        --gpu-memory-utilization 0.98 \
        --enforce-eager \
        --tokenizer_mode mistral \
        --limit_mm_per_prompt 'image=8' \
        --max-model-len=32768 \
        --max-num-batched-tokens=32768 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.pixtral.txt

All seems to work for sending an image query. But as soon as I try any simple guided_json or guided_choice, it always fails.

from openai import OpenAI

base_url = 'http://IP:80/v1'  # replace IP
api_key = "EMPTY"

client_args = dict(base_url=base_url, api_key=api_key)
openai_client = OpenAI(**client_args)

prompt = """<all_documents>
<doc>
<name>roses.pdf</name>
<page>1</page>
<text>
I like red roses, and red elephants.
</text>
</doc>
</all_documents>

<response_format_instructions>

Ensure you follow this JSON schema, and ensure to use the same key names as the schema:
\`\`\`json
{"color": {"type": "string"}}
\`\`\`

</response_format_instructions>

What do I like?"""

guided_json = {"type": "object",
    "properties": {
        "color": {
        "type": "string"
        }
        }
}

messages = [{'role': 'user', 'content': prompt}]
stream = False
client_kwargs = dict(model='mistralai/Pixtral-12B-2409',
                     max_tokens=2048, stream=stream, messages=messages,
                     response_format=dict(type='json_object'),
                     extra_body=dict(guided_json=guided_json))
client = openai_client.chat.completions

responses = client.create(**client_kwargs)
text = responses.choices[0].message.content
print(text)

gives:

Traceback (most recent call last):
  File "/home/jon/h2ogpt/check_openai_json1.py", line 51, in <module>
    responses = client.create(**client_kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_utils/_utils.py", line 274, in wrapper
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 668, in create
    return self._post(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1259, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 936, in request
    return self._request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1025, in _request
    return self._retry_request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1025, in _request
    return self._retry_request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1074, in _retry_request
    return self._request(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/openai/_base_client.py", line 1040, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500

and vllm shows:

INFO:     172.16.0.101:3146 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.16.0.101:3146 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.16.0.101:3146 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

Model Input Dumps

No response

🐛 Describe the bug

See above.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 2 months ago

Can you show the server-side stack trace?

pseudotensor commented 2 months ago

I did, it only shows that "Internal Server Error" as I mentioned there's literally nothing else in the server trace except that. No stack trace etc.

DarkLight1337 commented 2 months ago

To better debug the issue, can you use guided decoding in offline inference via LLM.chat method? That should show the full stack trace.

stikkireddy commented 2 months ago

i have the same issue but the environment i have access to;

the following just hangs :(. the api server throws internal server error.

from vllm import LLM
llm = LLM(
  model="/root/models/mistralai/Pixtral-12B-2409",
  tokenizer_mode="mistral",
  served_model_name="mistralai/Pixtral-12B-2409",
  max_model_len=5*4096,
  guided_decoding_backend="outlines",
  limit_mm_per_prompt={"image": 5},
  tensor_parallel_size=4,
)

ywang96 commented 2 months ago

I don't think guided_decoding/outlines officially supports mistral tokenizer (we still need to double check on this), and I don't think it's really vLLM's responsibility to make sure they work with each other if they don't. However, if they are indeed incompatible, then we should disable guided_decoding when mistral tokenizer is present.

Perhaps @patrickvonplaten you might have some thoughts for this?

patrickvonplaten commented 1 month ago

For now can we raise a NotImplementedError? Make with a error message that asks for a contribution if people are interested in this feature?

ywang96 commented 1 month ago

For now can we raise a NotImplementedError? Make with a error message that asks for a contribution if people are interested in this feature?

Yea, I think that's a good idea and something rather straightforward to do!

gcalmettes commented 1 month ago

The latest code of lm-format-enforcer should now be compatible with the MistralTokenizer. There is no release yet, but installing the library from main should do the trick:

pip install git+https://github.com/noamgat/lm-format-enforcer.git --force-reinstall

@stikkireddy your code should run now, if you switch the guided_decoding_backend to lm-format-enforcer

from vllm import LLM
llm = LLM(
  model="/root/models/mistralai/Pixtral-12B-2409",
  tokenizer_mode="mistral",
  served_model_name="mistralai/Pixtral-12B-2409",
  max_model_len=5*4096,
  guided_decoding_backend="lm-format-enforcer",
  limit_mm_per_prompt={"image": 5},
  tensor_parallel_size=4,
)

vllm-project / vllm