sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.95k stars 490 forks source link

[Bug] OpenAI Compatible Prompt Template Error #1265

Closed BabyChouSr closed 2 months ago

BabyChouSr commented 2 months ago

Checklist

Describe the bug

I noticed that the prompt template that is applied is incorrect because it does not parse the text correctly. More specifically, in the openai_api/adapter.py, we use Huggingface to tokenize:

 prompt_ids = tokenizer_manager.tokenizer.apply_chat_template(
                    request.messages, tokenize=True, add_generation_prompt=True
                )

But, the request messages is still in Pydantic form so then it gets incorrectly tokenized by the huggingface tokenizer.

For example, if the request.messages is:

[ChatCompletionMessageUserParam(role='user', content=[ChatCompletionMessageContentImagePart(type='image_url', image_url=ChatCompletionMessageContentImageURL(url='https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png', detail='auto')), ChatCompletionMessageContentTextPart(type='text', text='Describe this image')]), ChatCompletionMessageGenericParam(role='assistant', content='This image is really fun.'), ChatCompletionMessageUserParam(role='user', content='Tell me a story about the image')]

We get the following prompt and prompt ids:

[128000, 128006, 882, 128007, 271, 58, 16047, 34290, 2097, 2831, 1945, 5920, 5930, 1151, 1843, 2975, 518, 2217, 2975, 28, 16047, 34290, 2097, 2831, 1945, 3222, 6659, 1151, 2485, 1129, 1059, 52027, 916, 14, 2034, 75, 34796, 14, 2034, 5317, 15711, 22158, 29647, 3592, 518, 7872, 1151, 3989, 47359, 13149, 34290, 2097, 2831, 1199, 5920, 5930, 1151, 1342, 518, 1495, 1151, 75885, 420, 2217, 52128, 128009, 128006, 78191, 128007, 271, 2028, 2217, 374, 2216, 2523, 13, 128009, 128006, 882, 128007, 271, 41551, 757, 264, 3446, 922, 279, 2217, 128009, 128006, 78191, 128007, 271]
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

[ChatCompletionMessageContentImagePart(type='image_url', image_url=ChatCompletionMessageContentImageURL(url='https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png', detail='auto')), ChatCompletionMessageContentTextPart(type='text', text='Describe this image')]<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This image is really fun.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell me a story about the image<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I have a suspicion that this is because huggingface does not know that we are using Pydantic datamodels and so it just looks at the content section of the messages and parses it which is just a string-ified version of the Pydantic data model.

Reproduction

python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000

Client script:

import openai
import time

working_messages = [
    {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png"
                    },
                },
                {"type": "text", "text": "Describe this image"},
            ],
            },
            {
             "role": "assistant",
             "content": "This image is really fun."
            },
            {
             "role": "user",
             "content": "Tell me a story about the image"
            }
]

client = openai.Client(api_key="EMPTY", base_url="http://127.0.0.1:30000/v1")
start = time.time()
response = client.chat.completions.create(
    model="llama3-llava-next-8b",
    messages=working_messages,
    temperature=0.3,
    max_tokens=512,
    stream=True,
)

for i, chunk in enumerate(response):
    if chunk.choices[0].delta.content is not None:
         print(chunk.choices[0].delta.content, end="")

Environment

Python: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA L4
GPU 0 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
sglang: 0.2.14.post2
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.65.0
numpy: 1.23.5
aiohttp: 3.9.5
fastapi: 0.109.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 23.0
PIL: 9.2.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.22.0
uvloop: 0.19.0
zmq: 26.0.3
vllm: 0.5.5
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576
merrymercy commented 2 months ago

If you use lmms-lab/llama3-llava-next-8b, you should not use the chat template in huggingface tokenizer. This is because the chat template in huggingface tokenizer does not support image.

The correct way is to use the custom chat template for llava-next. You can specify it when you launch the server. https://github.com/sgl-project/sglang/blob/55f5976b42d736f3dfe2f8f9b91a6536c212744a/README.md?plain=1#L246-L247