[Bug] OpenAI Compatible Prompt Template Error

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

I noticed that the prompt template that is applied is incorrect because it does not parse the text correctly. More specifically, in the openai_api/adapter.py, we use Huggingface to tokenize:

 prompt_ids = tokenizer_manager.tokenizer.apply_chat_template(
                    request.messages, tokenize=True, add_generation_prompt=True
                )

But, the request messages is still in Pydantic form so then it gets incorrectly tokenized by the huggingface tokenizer.

For example, if the request.messages is:

[ChatCompletionMessageUserParam(role='user', content=[ChatCompletionMessageContentImagePart(type='image_url', image_url=ChatCompletionMessageContentImageURL(url='https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png', detail='auto')), ChatCompletionMessageContentTextPart(type='text', text='Describe this image')]), ChatCompletionMessageGenericParam(role='assistant', content='This image is really fun.'), ChatCompletionMessageUserParam(role='user', content='Tell me a story about the image')]

We get the following prompt and prompt ids:

[128000, 128006, 882, 128007, 271, 58, 16047, 34290, 2097, 2831, 1945, 5920, 5930, 1151, 1843, 2975, 518, 2217, 2975, 28, 16047, 34290, 2097, 2831, 1945, 3222, 6659, 1151, 2485, 1129, 1059, 52027, 916, 14, 2034, 75, 34796, 14, 2034, 5317, 15711, 22158, 29647, 3592, 518, 7872, 1151, 3989, 47359, 13149, 34290, 2097, 2831, 1199, 5920, 5930, 1151, 1342, 518, 1495, 1151, 75885, 420, 2217, 52128, 128009, 128006, 78191, 128007, 271, 2028, 2217, 374, 2216, 2523, 13, 128009, 128006, 882, 128007, 271, 41551, 757, 264, 3446, 922, 279, 2217, 128009, 128006, 78191, 128007, 271]

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

[ChatCompletionMessageContentImagePart(type='image_url', image_url=ChatCompletionMessageContentImageURL(url='https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png', detail='auto')), ChatCompletionMessageContentTextPart(type='text', text='Describe this image')]<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This image is really fun.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell me a story about the image<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I have a suspicion that this is because huggingface does not know that we are using Pydantic datamodels and so it just looks at the content section of the messages and parses it which is just a string-ified version of the Pydantic data model.

Reproduction

python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000

Client script:

import openai
import time

working_messages = [
    {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"
                    },
                },
                {"type": "text", "text": "Describe this image"},
            ],
            },
            {
             "role": "assistant",
             "content": "This image is really fun."
            },
            {
             "role": "user",
             "content": "Tell me a story about the image"
            }
]

client = openai.Client(api_key="EMPTY", base_url="http://127.0.0.1:30000/v1")
start = time.time()
response = client.chat.completions.create(
    model="llama3-llava-next-8b",
    messages=working_messages,
    temperature=0.3,
    max_tokens=512,
    stream=True,
)

for i, chunk in enumerate(response):
    if chunk.choices[0].delta.content is not None:
         print(chunk.choices[0].delta.content, end="")

Environment

Python: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA L4
GPU 0 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
sglang: 0.2.14.post2
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.65.0
numpy: 1.23.5
aiohttp: 3.9.5
fastapi: 0.109.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 23.0
PIL: 9.2.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.22.0
uvloop: 0.19.0
zmq: 26.0.3
vllm: 0.5.5
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

sgl-project / sglang