sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.41k stars 394 forks source link

Llama-3 regex generation can get stuck in infinite generation beyond max_tokens and crash server (reproduction example) #414

Closed Gintasz closed 2 months ago

Gintasz commented 4 months ago

Hey, I've just been trying to catch this bug for half a day...

I've done pip install git+https://github.com/sgl-project/sglang.git@51104cd#subdirectory=python, which is the commit where 0.1.14 was mentioned.

Launched server like this:

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 42069 --host 0.0.0.0 --tp-size 1 --mem-fraction-static 0.85

When the script below is launched, the server will get stuck in an infinite generation loop, which is long beyond the specified max_tokens=1024. Then it will crash. In my app there was some CUDA device assertion error (although same problem), however, in the reproduced example below the error is RecursionError: maximum recursion depth exceeded while calling a Python object. This is the log of server: logfile.txt

import sglang as sgl
import asyncio
import time

@sgl.function
def demo(s):
    s += sgl.system("You are a text string generation. Your goal is to generate a response to the user's instruction.")
    s += sgl.user_begin() + """I instruct you to make 10000 random text strings. Format your response like this:
```yaml
- "string1"
- "string2"
```""" + sgl.user_end()
    s += sgl.assistant_begin() + "```yaml\n" + sgl.gen("answer", temperature=0, regex=r'- "[^"\n]+"(?:\n- "[^"\n]+")*\n```|```', stop="```", max_tokens=1024)

endpoint = sgl.RuntimeEndpoint("http://REMOTEIP:PORT")
sgl.set_default_backend(endpoint)

async def main():
    state = demo.run()

asyncio.run(main())

If regex is removed, then there is no problem, the generation will stop when the token limit is exceeded.

If I change the model to mistralai/Mistral-7B-Instruct-v0.2, then there appears no such issue.

Other than that, meta-llama/Meta-Llama-3-8B-Instruct does work with other prompts using the same regex.

Gintasz commented 4 months ago

I've avoided the problem by replacing the YAML format for output generation with XML format. r"<array>\n(?:<string>.*?<\/string>\n)*<\/array>```"

IliaZenkov commented 4 months ago

I had the same problem with llama3 refusing to stop despite using the llama3-instruct template "<|eot_id|>" appropriate stop string. I added "assistant" as a stop string in the call to sgl.gen and this seemed to have abated the issue entirely. Can you give that a try with your YAML regex?

m0g1cian commented 4 months ago

I had the same problem with llama3 refusing to stop despite using the llama3-instruct template "<|eot_id|>" appropriate stop string. I added "assistant" as a stop string in the call to sgl.gen and this seemed to have abated the issue entirely. Can you give that a try with your YAML regex?

I've heard that you need to set global_config.skip_special_tokens_in_output to False in sglang.global_config. Then "<|eot_id|>" will start to be effective.

github-actions[bot] commented 2 months ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

ssenichev commented 1 month ago

I had the same problem with llama3 refusing to stop despite using the llama3-instruct template "<|eot_id|>" appropriate stop string. I added "assistant" as a stop string in the call to sgl.gen and this seemed to have abated the issue entirely. Can you give that a try with your YAML regex?

I've heard that you need to set global_config.skip_special_tokens_in_output to False in sglang.global_config. Then "<|eot_id|>" will start to be effective.

faced the same issue, tried to set global_config.skip_special_tokens_in_output to False, nothing changed