vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.78k stars 4.67k forks source link

[Bug]: Error with structured output inference after upgrade 0.6.2->0.6.3 #9462

Open akepa opened 1 month ago

akepa commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text ollecting environment information... /opt/conda/lib/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-6.1.109-118.189.amzn2023.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 560.35.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R32 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 0 BogoMIPS: 5599.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB (4 instances) L1i cache: 128 KiB (4 instances) L2 cache: 2 MiB (4 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.99 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==25.1.2 [pip3] torch==2.4.0 [pip3] torchaudio==2.2.1+cu121 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] nomkl 1.0 h5ca1d4c_0 conda-forge [conda] numpy 1.26.4 py311h64a7726_0 conda-forge [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.99 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 25.1.2 py311h34ded2d_0 conda-forge [conda] torch 2.4.0 pypi_0 pypi [conda] torchaudio 2.2.1+cu121 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.45.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A (dev) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-7 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

After upgrading from version 0.6.2 to 0.6.3 I started getting a validation error while generating structured input.

To reproduce:

  1. vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto
  2. Execute the following code. In my case, I do it from a Jupyter Notebook:
#### OUTPUT DEFINITION

from pydantic import BaseModel, Field
from enum import Enum
from typing import List
from typing import Optional
import json
from openai import OpenAI

class BedType(Enum):
    Twin = "Twin"
    Double = "Double"
    Queen = "Queen"
    King = "King"

class RoomBeds(BaseModel):
    bed_type: BedType = Field(...,description="Type of the bed in the hotel room")
    quantity: int = Field(...,description="Number of beds of the given bed type within the hotel room")

class HotelRoom(BaseModel):
    """
    Represents a hotel room.
    """
    room_id: str = Field(...,description="Id of the room from the input")
    room_name: Optional[str] = Field(...,description="Freetext name of the hotel room")
    room_class: Optional[str] = Field(..., description="Room class of the hotel room.")
    bed_types: Optional[List[RoomBeds]] = Field(..., description="List of beds within the hotel room.")
    smoking_allowed: Optional[bool] = Field(..., description="Flag that indicates whether smoking is allowed or not in the hotel room. Unknown value used if it cannot be infered from the room description")

class Hotel(BaseModel):
    """
    Represents an entry about a hotel.
    """
    hotel_rooms: List[HotelRoom] = Field(..., description="List of hotel rooms within a hotel")

#### ONLINE INFERENCE

client = OpenAI(
        base_url="http://localhost:8000/v1",
        api_key="token-abc123",
    )

completion = client.beta.chat.completions.parse(
            seed=42,
            model= "NousResearch/Meta-Llama-3-8B-Instruct",
            messages=[
                {"role": "system", "content": "You are a helpful assistant"},
                {"role": "user", "content": "Generate synthetic data for a fictitious hotel." },
            ],
            temperature=0.8,
            top_p=0.95,
            response_format=Hotel
            )

With version 0.6.2 I was always getting a structured output with the specified format. However, after upgrading to 0.6.3 I get a validation error as it seems the response does not match the expected format:


Cell In[10], line 1
----> 1 completion = client.beta.chat.completions.parse(
      2             seed=42,
      3             model= "NousResearch/Meta-Llama-3-8B-Instruct", # "NousResearch/Meta-Llama-3-8B-Instruct", #Hermes-2-Pro-Llama-3-8B-GGUF
      4             messages=[
      5                 {"role": "system", "content": "You are a helpful assistant"},
      6                 {"role": "user", "content": "Generate synthetic data for a fictitious hotel." },
      7             ],
      8             temperature=0.8,
      9             top_p=0.95,
     10             response_format=Hotel
     11             )

File /opt/conda/lib/python3.11/site-packages/openai/resources/beta/chat/completions.py:150, in Completions.parse(self, messages, model, response_format, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, n, parallel_tool_calls, presence_penalty, seed, service_tier, stop, store, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, extra_headers, extra_query, extra_body, timeout)
    143 def parser(raw_completion: ChatCompletion) -> ParsedChatCompletion[ResponseFormatT]:
    144     return _parse_chat_completion(
    145         response_format=response_format,
    146         chat_completion=raw_completion,
    147         input_tools=tools,
    148     )
--> 150 return self._post(
    151     "/chat/completions",
    152     body=maybe_transform(
    153         {
    154             "messages": messages,
    155             "model": model,
    156             "frequency_penalty": frequency_penalty,
    157             "function_call": function_call,
    158             "functions": functions,
    159             "logit_bias": logit_bias,
    160             "logprobs": logprobs,
    161             "max_completion_tokens": max_completion_tokens,
    162             "max_tokens": max_tokens,
    163             "metadata": metadata,
    164             "n": n,
    165             "parallel_tool_calls": parallel_tool_calls,
    166             "presence_penalty": presence_penalty,
    167             "response_format": _type_to_response_format(response_format),
    168             "seed": seed,
    169             "service_tier": service_tier,
    170             "stop": stop,
    171             "store": store,
    172             "stream": False,
    173             "stream_options": stream_options,
    174             "temperature": temperature,
    175             "tool_choice": tool_choice,
    176             "tools": tools,
    177             "top_logprobs": top_logprobs,
    178             "top_p": top_p,
    179             "user": user,
    180         },
    181         completion_create_params.CompletionCreateParams,
    182     ),
    183     options=make_request_options(
    184         extra_headers=extra_headers,
    185         extra_query=extra_query,
    186         extra_body=extra_body,
    187         timeout=timeout,
    188         post_parser=parser,
    189     ),
    190     # we turn the `ChatCompletion` instance into a `ParsedChatCompletion`
    191     # in the `parser` function above
    192     cast_to=cast(Type[ParsedChatCompletion[ResponseFormatT]], ChatCompletion),
    193     stream=False,
    194 )

File /opt/conda/lib/python3.11/site-packages/openai/_base_client.py:1277, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
   1263 def post(
   1264     self,
   1265     path: str,
   (...)
   1272     stream_cls: type[_StreamT] | None = None,
   1273 ) -> ResponseT | _StreamT:
   1274     opts = FinalRequestOptions.construct(
   1275         method="post", url=path, json_data=body, files=to_httpx_files(files), **options
   1276     )
-> 1277     return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File /opt/conda/lib/python3.11/site-packages/openai/_base_client.py:954, in SyncAPIClient.request(self, cast_to, options, remaining_retries, stream, stream_cls)
    951 else:
    952     retries_taken = 0
--> 954 return self._request(
    955     cast_to=cast_to,
    956     options=options,
    957     stream=stream,
    958     stream_cls=stream_cls,
    959     retries_taken=retries_taken,
    960 )

File /opt/conda/lib/python3.11/site-packages/openai/_base_client.py:1060, in SyncAPIClient._request(self, cast_to, options, retries_taken, stream, stream_cls)
   1057     log.debug("Re-raising status error")
   1058     raise self._make_status_error_from_response(err.response) from None
-> 1060 return self._process_response(
   1061     cast_to=cast_to,
   1062     options=options,
   1063     response=response,
   1064     stream=stream,
   1065     stream_cls=stream_cls,
   1066     retries_taken=retries_taken,
   1067 )

File /opt/conda/lib/python3.11/site-packages/openai/_base_client.py:1159, in SyncAPIClient._process_response(self, cast_to, options, response, stream, stream_cls, retries_taken)
   1156 if bool(response.request.headers.get(RAW_RESPONSE_HEADER)):
   1157     return cast(ResponseT, api_response)
-> 1159 return api_response.parse()

File /opt/conda/lib/python3.11/site-packages/openai/_response.py:319, in APIResponse.parse(self, to)
    317 parsed = self._parse(to=to)
    318 if is_given(self._options.post_parser):
--> 319     parsed = self._options.post_parser(parsed)
    321 if isinstance(parsed, BaseModel):
    322     add_request_id(parsed, self.request_id)

File /opt/conda/lib/python3.11/site-packages/openai/resources/beta/chat/completions.py:144, in Completions.parse.<locals>.parser(raw_completion)
    143 def parser(raw_completion: ChatCompletion) -> ParsedChatCompletion[ResponseFormatT]:
--> 144     return _parse_chat_completion(
    145         response_format=response_format,
    146         chat_completion=raw_completion,
    147         input_tools=tools,
    148     )

File /opt/conda/lib/python3.11/site-packages/openai/lib/_parsing/_completions.py:110, in parse_chat_completion(response_format, input_tools, chat_completion)
    100             else:
    101                 tool_calls.append(tool_call)
    103     choices.append(
    104         construct_type_unchecked(
    105             type_=cast(Any, ParsedChoice)[solve_response_format_t(response_format)],
    106             value={
    107                 **choice.to_dict(),
    108                 "message": {
    109                     **message.to_dict(),
--> 110                     "parsed": maybe_parse_content(
    111                         response_format=response_format,
    112                         message=message,
    113                     ),
    114                     "tool_calls": tool_calls,
    115                 },
    116             },
    117         )
    118     )
    120 return cast(
    121     ParsedChatCompletion[ResponseFormatT],
    122     construct_type_unchecked(
   (...)
    128     ),
    129 )

File /opt/conda/lib/python3.11/site-packages/openai/lib/_parsing/_completions.py:161, in maybe_parse_content(response_format, message)
    155 def maybe_parse_content(
    156     *,
    157     response_format: type[ResponseFormatT] | ResponseFormatParam | NotGiven,
    158     message: ChatCompletionMessage | ParsedChatCompletionMessage[object],
    159 ) -> ResponseFormatT | None:
    160     if has_rich_response_format(response_format) and message.content is not None and not message.refusal:
--> 161         return _parse_content(response_format, message.content)
    163     return None

File /opt/conda/lib/python3.11/site-packages/openai/lib/_parsing/_completions.py:221, in _parse_content(response_format, content)
    219 def _parse_content(response_format: type[ResponseFormatT], content: str) -> ResponseFormatT:
    220     if is_basemodel_type(response_format):
--> 221         return cast(ResponseFormatT, model_parse_json(response_format, content))
    223     if is_dataclass_like_type(response_format):
    224         if not PYDANTIC_V2:

File /opt/conda/lib/python3.11/site-packages/openai/_compat.py:166, in model_parse_json(model, data)
    164 def model_parse_json(model: type[_ModelT], data: str | bytes) -> _ModelT:
    165     if PYDANTIC_V2:
--> 166         return model.model_validate_json(data)
    167     return model.parse_raw(data)

File /opt/conda/lib/python3.11/site-packages/pydantic/main.py:625, in BaseModel.model_validate_json(cls, json_data, strict, context)
    623 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    624 __tracebackhide__ = True
--> 625 return cls.__pydantic_validator__.validate_json(json_data, strict=strict, context=context)

ValidationError: 1 validation error for Hotel
  Invalid JSON: expected ident at line 1 column 2 [type=json_invalid, input_value='I\'d be happy to help ge... requests or questions.', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/json_invalid```

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
gcalmettes commented 1 month ago

It should be fixed on main via https://github.com/vllm-project/vllm/pull/9530