vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.92k stars 4.7k forks source link

[Bug]: Outlines w/ Mistral - 'MistralTokenizer' object has no attribute 'eos_token' #10138

Closed matbee-eth closed 1 week ago

matbee-eth commented 2 weeks ago

Your current environment

The output of `python collect_env.py` ```sh docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --ipc=host -p 8000:8000 \ vllm/vllm-openai --model mistralai/Pixtral-12B-2409 --tensor-parallel-size 2 --max-model-len 32768 --max-seq-len-to-capture=32768 --kv-cache-dtype fp8 --tokenizer_mode "mistral" --limit-mm-per-prompt image=4 ```

Model Input Dumps

INFO:     172.17.0.1:51966 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 73, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
    return await self.chat_completion_full_generator(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 624, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 458, in iterate_with_cancellation
    item = await awaits[0]
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 547, in _process_request
    params = await \
             ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 528, in build_guided_decoding_logits_processor_async
    processor = await get_guided_decoding_logits_processor(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/guided_decoding/__init__.py", line 14, in get_guided_decoding_logits_processor
    return await get_outlines_guided_decoding_logits_processor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/guided_decoding/outlines_decoding.py", line 72, in get_outlines_guided_decoding_logits_processor
    return await loop.run_in_executor(global_thread_pool,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/guided_decoding/outlines_decoding.py", line 131, in _get_logits_processor
    return CFGLogitsProcessor(guide, tokenizer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 165, in __init__
    super().__init__(CFGLogitsProcessor._get_guide(cfg, tokenizer))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/outlines/caching.py", line 122, in wrapper
    result = cached_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 152, in _get_guide
    return CFGGuide(cfg, tokenizer)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/outlines/fsm/guide.py", line 272, in __init__
    self.terminal_regexps["$END"] = tokenizer.eos_token
                                    ^^^^^^^^^^^^^^^^^^^
AttributeError: 'MistralTokenizer' object has no attribute 'eos_token'. Did you mean: 'eos_token_id'?

🐛 Describe the bug

Using Outlines with Mistral tokenizer model fails as the tokenizer calls it tokenizer.eos_token_id not tokenizer.eos_token

Before submitting a new issue...

matbee-eth commented 2 weeks ago

It looks like outlines may have updated this code but the docker image does not use an up to date outlines version

DarkLight1337 commented 2 weeks ago

Which is your version of outlines?

matbee-eth commented 2 weeks ago

Which is your version of outlines?

I'm not sure, it's the vllm-openai docker image

Here is how I'm testing it:

import requests

url = "http://localhost:8000/v1/chat/completions"

headers = {
    "Content-Type": "application/json"
}

messages: list[dict[str, str]] = [
    {"content": "You are a helpful AI assistant", "role": "system"},
    {"content": "go to google", "role": "user"},
    {"content": "Below I will present you a request. Before we begin addressing the request, please answer the following pre-survey to the best of your ability. Keep in mind that you are Ken Jennings-level with trivia, and Mensa-level with puzzles, so there should be a deep well to draw from.\n\nHere is the request:\n\ngo to google\n\nHere is the pre-survey:\n\n    1. Please list any specific facts or figures that are GIVEN in the request itself. It is possible that there are none.\n    2. Please list any facts that may need to be looked up, and WHERE SPECIFICALLY they might be found. In some cases, authoritative sources are mentioned in the request itself.\n    3. Please list any facts that may need to be derived (e.g., via logical deduction, simulation, or computation)\n    4. Please list any facts that are recalled from memory, hunches, well-reasoned guesses, etc.\n\nWhen answering this survey, keep in mind that \"facts\" will typically be specific names, dates, statistics, etc. Your answer should use headings:\n\n    1. GIVEN OR VERIFIED FACTS\n    2. FACTS TO LOOK UP\n    3. FACTS TO DERIVE\n    4. EDUCATED GUESSES\n\nDO NOT include any other headings or sections in your response. DO NOT list next steps or plans until asked to do so.\n", "role": "user"},
    {"content": "Certainly! Here is the pre-survey response based on the request \"go to google\":\n\n### 1. GIVEN OR VERIFIED FACTS\n- The request is to \"go to google.\"\n\n### 2. FACTS TO LOOK UP\n- None. The request itself does not require any additional information to look up.\n\n### 3. FACTS TO DERIVE\n- None. The request is straightforward and does not require logical deduction or computation.\n\n### 4. EDUCATED GUESSES\n- The user likely intends to perform a web search using the Google search engine.\n- The user might be looking for specific information, general browsing, or accessing Google services such as Gmail, Google Maps, or Google Drive.", "role": "assistant"},
    {"content": "Fantastic. To address this request we have assembled the following team:\n\nWebSurfer: A helpful assistant with access to a web browser. Ask them to perform web searches, open pages, and interact with content (e.g., clicking links, scrolling the viewport, etc., filling in form fields, etc.) It can also summarize the entire page, or answer questions based on the content of the page. It can also be asked to sleep and wait for pages to load, in cases where the pages seem to be taking a while to load.\nCoder: A helpful and general-purpose AI assistant that has strong language skills, Python skills, and Linux command line skills.\nExecutor: A agent for executing code\nfile_surfer: An agent that can handle local files.\n\n\nBased on the team composition, and known and unknown facts, please devise a short bullet-point plan for addressing the original request. Remember, there is no requirement to involve all team members -- a team member's particular expertise may not be needed for this task.", "role": "user"}
]

payload: dict[str, str | list[dict[str, str]] | dict[str, str] | bool] = {
    "messages": messages,
    "model": "mistralai/Pixtral-12B-2409",
    "response_format": {"type": "json_object"},
    "stream": False
}

response = requests.post(url, json=payload, headers=headers)

# Print response
print(response.status_code)
print(response.json())
DarkLight1337 commented 2 weeks ago

I just remembered that Mistral tokenizer doesn't support outlines. https://github.com/vllm-project/vllm/issues/9359#issuecomment-2412840803

DarkLight1337 commented 1 week ago

Please open an issue on their repo to request support.