sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.9k stars 477 forks source link

JSON decoding using regex breaks if model outputs a " #258

Closed aliencaocao closed 7 months ago

aliencaocao commented 8 months ago

As topic suggests.

Using pydantic -> regex

It causes model to subsequently output blank as the logit mask goes on.

Is there any way to escape " produces by the model in a json field?

hnyls2002 commented 8 months ago

@aliencaocao Can you provide more details or code to reproduce your problem?

aliencaocao commented 8 months ago

Its a common issue as gpt4 json mode also have this problem. If a json field is string, and model decides to include double quotes in this string, it prematurely ends the field as the regex will match it as an ended string, but model does not think so, so it keep trying to output something other than the next key of the json, which after logit masking by sglang, becomes all tokens that appear as blank in console. Just have any json as regex and instruct the model to output double quotes will trigger this.

Qubitium commented 8 months ago

I think this is an inherent issue of json string enforcement. There is no good way to tell when a model output ends a "string" since everything is a string. json string can contain any character even special format ones like \n. The model is not trained to escape chars the way json expects so there is no way to tell apart " vs \".

DouHappy commented 7 months ago

Its a common issue as gpt4 json mode also have this problem. If a json field is string, and model decides to include double quotes in this string, it prematurely ends the field as the regex will match it as an ended string, but model does not think so, so it keep trying to output something other than the next key of the json, which after logit masking by sglang, becomes all tokens that appear as blank in console. Just have any json as regex and instruct the model to output double quotes will trigger this. @aliencaocao We can use the following regular expression to match string values in JSON. I am confused why the " still exists when it is prohibited in this regular expression? I also occured the same problem.

("(?:[^"\\x00-\x1f\x7f-\x9f]|\.)*"|null)

430d02c0663de8fca7432d9a03a27d1
DouHappy commented 6 months ago

Its a common issue as gpt4 json mode also have this problem. If a json field is string, and model decides to include double quotes in this string, it prematurely ends the field as the regex will match it as an ended string, but model does not think so, so it keep trying to output something other than the next key of the json, which after logit masking by sglang, becomes all tokens that appear as blank in console. Just have any json as regex and instruct the model to output double quotes will trigger this.

@aliencaocao Thank you for your explanation. Do you know where the relevant code is? I hope to have a deeper understanding of this issue in sglang.