sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.8k stars 464 forks source link

[Bug] gen with regex: Token fusion between input and output, try to avoid this by removing the space at the end of the input. #1312

Open alanxmay opened 1 month ago

alanxmay commented 1 month ago

Checklist

Describe the bug

Run the official example of how to json decode examples/usage/chinese_regex.py. The server print too many logs with:

...
[00:07:30 TP0] Token fusion between input and output, try to avoid this by removing the space at the end of the input.
[00:07:30 TP2] Token fusion between input and output, try to avoid this by removing the space at the end of the input.
[00:07:30 TP1] Token fusion between input and output, try to avoid this by removing the space at the end of the input.
[00:07:30 TP3] Token fusion between input and output, try to avoid this by removing the space at the end of the input.
[00:07:30 TP0] Token fusion between input and output, try to avoid this by removing the space at the end of the input.
[00:07:30 TP2] Token fusion between input and output, try to avoid this by removing the space at the end of the input.
...

Reproduction

Model: Qwen2-72B-Instruct-GPTQ-Int4

Start backend:

docker run --gpus all \
    -p 9000:30000 \
    -v /models/:/models/ \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path /models/Qwen2-72B-Instruct-GPTQ-Int4 --host 0.0.0.0 --port 30000 --mem-fraction-static 0.64 --tp 4

Run examples/usage/chinese_regex.py.

Environment

Python: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3: NVIDIA A10 GPU 0,1,2,3 Compute Capability: 8.6 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 535.54.03 PyTorch: 2.4.0+cu121 flashinfer: 0.1.6+cu121torch2.4 triton: 3.0.0 transformers: 4.44.2 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.5 fastapi: 0.112.2 hf_transfer: 0.1.8 huggingface_hub: 0.24.6 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.6 uvloop: 0.20.0 zmq: 26.2.0 vllm: 0.5.5 multipart: 0.0.9 openai: 1.43.0 anthropic: 0.34.1 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE SYS SYS 0-31,64-95 0 N/A GPU1 NODE X SYS SYS 0-31,64-95 0 N/A GPU2 SYS SYS X NODE 32-63,96-127 1 N/A GPU3 SYS SYS NODE X 32-63,96-127 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

zhyncs commented 1 month ago

@hnyls2002 may help take a look

msublee commented 1 month ago

I also experienced the same problem. In my case, I passed the tokenized input (input_ids) to the sglang server. I confirmed that the <bos> token was added once more when encoding at L273, and the L283 condition was not passed. https://github.com/sgl-project/sglang/blob/eda7c09048b39bd03b0e34aa16ffef9398072663/python/sglang/srt/managers/schedule_batch.py#L273 https://github.com/sgl-project/sglang/blob/eda7c09048b39bd03b0e34aa16ffef9398072663/python/sglang/srt/managers/schedule_batch.py#L283

draqos commented 3 weeks ago

i am experiencing the same problem with Mistral Nemo Instruct. The generated output is endless zeroes or repeating the same phrase over and over or or

dmakhervaks commented 2 weeks ago

I am also seeing issues with Llama 70B. I invoke the openAI API and when inputting a Json schema it is way slower than without.

The logs look like this:

image
dmakhervaks commented 2 weeks ago

Notably this happens when calling OpenAI style request, but not when doing the SGLang front end language function call.

Here is an example of openAI style:

class CalendarEvent(BaseModel):
        name: str
        date: str
        participants: list[str]

def open_style_request():
    messages =[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ]

    json_schema = json.dumps(
        {
            "type": "object",
            "properties": {
                "name": {"type": "string",},
                "date": {"type": "string"},  # Assuming date is in string format
                "participants": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["name", "date", "participants"],
        }
    )

    def send_request():
        return client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0,
                max_tokens=128,
                response_format={
                    "type": "json_schema",
                    "json_schema": {"name": "foo", "schema": json.loads(json_schema)},
                }
        )
havetc commented 2 weeks ago

I am also seeing issues with Llama 70B. I invoke the openAI API and when inputting a Json schema it is way slower than without.

The logs look like this: image

I add the same problem, the only workaround I found to remove theses warnings was to disable the jump forward cache with --disable-regex-jump-forward. It maybe also increased the speed a little bit, though I'm less sure about it (and I was using a very complex schema where jump_forward couldn't have helped a lot anyway).

It is obviously not a solution, as the jump forward cache could be a valuable optimization