Open alanxmay opened 1 month ago
@hnyls2002 may help take a look
I also experienced the same problem. In my case, I passed the tokenized input (input_ids
) to the sglang server.
I confirmed that the <bos>
token was added once more when encoding at L273, and the L283 condition was not passed.
https://github.com/sgl-project/sglang/blob/eda7c09048b39bd03b0e34aa16ffef9398072663/python/sglang/srt/managers/schedule_batch.py#L273
https://github.com/sgl-project/sglang/blob/eda7c09048b39bd03b0e34aa16ffef9398072663/python/sglang/srt/managers/schedule_batch.py#L283
i am experiencing the same problem with Mistral Nemo Instruct. The generated output is endless zeroes or repeating the same phrase over and over or or
I am also seeing issues with Llama 70B. I invoke the openAI API and when inputting a Json schema it is way slower than without.
The logs look like this:
Notably this happens when calling OpenAI style request, but not when doing the SGLang front end language function call.
Here is an example of openAI style:
class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]
def open_style_request():
messages =[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
]
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string",},
"date": {"type": "string"}, # Assuming date is in string format
"participants": {
"type": "array",
"items": {"type": "string"},
},
},
"required": ["name", "date", "participants"],
}
)
def send_request():
return client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
max_tokens=128,
response_format={
"type": "json_schema",
"json_schema": {"name": "foo", "schema": json.loads(json_schema)},
}
)
I am also seeing issues with Llama 70B. I invoke the openAI API and when inputting a Json schema it is way slower than without.
The logs look like this:
I add the same problem, the only workaround I found to remove theses warnings was to disable the jump forward cache with --disable-regex-jump-forward
. It maybe also increased the speed a little bit, though I'm less sure about it (and I was using a very complex schema where jump_forward couldn't have helped a lot anyway).
It is obviously not a solution, as the jump forward cache could be a valuable optimization
Checklist
Describe the bug
Run the official example of how to json decode
examples/usage/chinese_regex.py
. The server print too many logs with:Reproduction
Model: Qwen2-72B-Instruct-GPTQ-Int4
Start backend:
Run
examples/usage/chinese_regex.py
.Environment
Python: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3: NVIDIA A10 GPU 0,1,2,3 Compute Capability: 8.6 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 535.54.03 PyTorch: 2.4.0+cu121 flashinfer: 0.1.6+cu121torch2.4 triton: 3.0.0 transformers: 4.44.2 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.5 fastapi: 0.112.2 hf_transfer: 0.1.8 huggingface_hub: 0.24.6 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.6 uvloop: 0.20.0 zmq: 26.2.0 vllm: 0.5.5 multipart: 0.0.9 openai: 1.43.0 anthropic: 0.34.1 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE SYS SYS 0-31,64-95 0 N/A GPU1 NODE X SYS SYS 0-31,64-95 0 N/A GPU2 SYS SYS X NODE 32-63,96-127 1 N/A GPU3 SYS SYS NODE X 32-63,96-127 1 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1048576