sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.16k stars 365 forks source link

[Bug] When using sglang as the inference framework, if a word starting with "\n" appears in the stop parameter, the sglang will Missing '\n' during inference #956

Closed nstl-zyb closed 1 month ago

nstl-zyb commented 1 month ago

Checklist

Describe the bug

When using sglang as the inference framework, if a word starting with "\n" appears in the stop parameter, the sglang will not wrap during inference。 EG: prompt = 请换行输出1-10个数字 stop = ['<|endoftext|>', '<|im_end|>', '<|im_start|>'] 1 2 3 4 5 6 7 8 9 10

prompt = 请换行输出1-10个数字 stop = ['\n<|endoftext|>', '<|im_end|>', '<|im_start|>'] 12345678910

"\n" can be followed by any character, and there will be no line break.

Reproduction

OS: Linux x64
GPU: A100 python:3.10 sglang:0.2.7 LLM model: Qwen2-72B-lora-awq-4bit cmd: python -m fastchat.serve.controller --host localhost --port 44000

python -m fastchat.serve.vllm_worker --model-path ${MODEL_PATH} --max-model-len 8192 --worker-address "http://0.0.0.0:22006" --port 22006 --model-names "qwen-latest" --controller-address "http://localhost:44000"

python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 21003 --controller-address "http://localhost:44000"

Then run code """ def test_open_ai(prompt: str, stream: bool = False, model: str = "qwen-latest"):

openai.api_key = "LTAI5t6C5QzrRfy5A4Ug4ujD"  # Not support yet
openai.api_base = "http://127.0.0.1:21003/v1"
completion = openai.ChatCompletion.create(
    model=model,
    messages=[{'role': 'user', 'content': prompt}],
    temperature=0.7,
    top_p=1.0,
    n=1,
    max_tokens=None,
    stream=False,
    presence_penalty=0.0,
    frequency_penalty=0.0,
    user=None,
    meta={},
    service="sas",
    scenario="Chat",
    stop_token_ids=[151643, 151644, 151645],
    stop=['\n<|endoftext|>', '<|im_end|>', '<|im_start|>'],
    max_new_tokens=8192,
)

if not stream:
    answer_md = completion.choices[0].message.content
    print(answer_md)
    return answer_md
else:
    pass

"""

Environment

Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda-12.2/
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 535.183.01
535.183.01
535.183.01
535.183.01
PyTorch: 2.3.1+cu121
sglang: 0.2.7
flashinfer: 0.1.3
requests: 2.32.3
tqdm: 4.66.4
numpy: 1.26.4
aiohttp: 3.10.0
fastapi: 0.111.1
hf_transfer: 0.1.8
huggingface_hub: 0.23.4
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.3
uvloop: 0.19.0
zmq: 26.0.3
vllm: 0.5.3.post1
openai: 1.37.1
anthropic: 0.32.0
NVIDIA Topology: 
    GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV12    NV12    NV12    0-63    0       N/A
GPU1    NV12     X  NV12    NV12    0-63    0       N/A
GPU2    NV12    NV12     X  NV12    0-63    0       N/A
GPU3    NV12    NV12    NV12     X  0-63    0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 65535
nstl-zyb commented 1 month ago

It is FastChat bug.close