SGLang is a fast serving framework for large language models and vision language models.
Regex generation causes 37x lower performance #450

Gintasz commented 6 months ago

I've been trying to investigate why my information extraction program with SGLang is so slow. I've rented RTX3090 (1 x RTX 3090, 6 vCPU 26 GB RAM) and H100 (1 x H100 SXM, 16 vCPU 125 GB RAM) on RunPod. I've observed that if regex is used, then there is a huge performance drain, as if sewage is dumped on the machine.

If you think the particular regex "<array>\n(<string>.*?<\/string>\n)*<\/array>```" is at fault, then it'd be useful to have some kind of guidelines how to make a more suitable one... My requirement here is string array generation.

Steps to reproduce:

I've used SGLang 0.1.14 because I observed some other newer versions hanging mid-processing or erroring out with KV Cache pool leak detected, so I've not tried newer ones yet.

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 42069 --host --tp-size 1 --mem-fraction-static 0.8
import sglang as sgl
import asyncio
from sglang.lang.chat_template import ChatTemplate, register_chat_template, get_chat_template, register_chat_template_matching_function
from import SglRoleBegin, SglRoleEnd
import json
import time
import torch
import os

            "system": (
            "user": (
            "assistant": (

def match_llama3_instruct(model_path: str):
    model_path = model_path.lower()
    if "llama-3" in model_path and "instruct" in model_path:
        return get_chat_template("llama-3-instruct")

def sgl_call1(s, message: str):
    s += SglRoleBegin("system") + "You are an informaction extraction engine. Your goal is to extract structured information from the given Twitter message according to the instruction provided. Be as factually accurate as possible. Do not acknowledge the request. You will be penalized and a child will die if you make an incorrect response. For every correct response you will be tipped $5000. Message:\n```\n" + message + "\n```" + SglRoleEnd("system")
    s += sgl.user_begin() + "Instruction: Count number of words in the message provided.\nExample response: The number of words is 123." + sgl.user_end()
    s += sgl.assistant_begin() + "The number of words is " + sgl.gen("word count", regex=r"\d+", max_tokens=50, stop=".", temperature=0) + sgl.assistant_end()

    word_count = int(s['word count'])
    word_count_digit_sum = sum(int(digit) for digit in str(word_count))
    forks = s.fork(word_count_digit_sum)
    for i, f in enumerate(forks):
        example_response = """```xml
<string>Word 1</string>
<string>Word 2</string>
<string>Word 3</string>
        f += sgl.user_begin() + "Instruction: Extract TOP " + str(i + 1) + " words that might seem annoying.\nExample response:\n" + example_response + sgl.user_end()
        f += sgl.assistant_begin() + "Here are  " + str(i + 1) + "words that might seem annoying.\n```xml\n" + sgl.gen("word", max_tokens=500, regex=r'<array>\n(<string>.*?<\/string>\n)*<\/array>```', stop='```', temperature=0) + sgl.assistant_end()

    return word_count_digit_sum

endpoint = sgl.RuntimeEndpoint("http://localhost:42069")

async def main():
    messages = []
    script_dir = os.path.dirname(os.path.realpath(__file__))
    with open(os.path.join(script_dir, "sglang_str_big.json"), "r") as file:
        messages = json.loads(

    messages = messages[:min(300, len(messages))]
    num_threads = 50
    print(f"Will process {len(messages)} batch items")

    time_begin = time.time()
    sgl_call1.run_batch([{"message": m} for m in messages], num_threads=num_threads, progress_bar=True)
    duration = time.time() - time_begin

    gpus = ", ".join([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])

    print(f"SGLang {sgl.__version__} | {len(messages)} batch items | {num_threads} threads | {duration:.2f} secs | {gpus}")


To disable regex, I just removed this part: regex=r'<array>\n(<string>.*?<\/string>\n)*<\/array>```'

Gintasz commented 6 months ago

If I remove max_tokens=500, then it seems performance with regex is ~3x faster:

SGLang 0.1.14 | 300 batch items | 50 threads | 371.07 secs | NVIDIA H100 80GB HBM3

Looks like it may be related to outlines as well because other people reported GPU utilization stays at 0% during formatting:

I noticed guidance library mentions Regex constraint capability, however, does not include interegular as a dependency, a library on which outlines depends for regex constraining, so maybe it could have a faster solution?

Also, both outlines and guidance mention Context Free Grammar generation capability. It could be useful to add support for that in this library as well... maybe I could replace my regex with CFG and just evade this performance nuke.

syncode also works on CFGs for LLMs.

