Closed aliencaocao closed 7 months ago
@aliencaocao Can you provide more details or code to reproduce your problem?
Its a common issue as gpt4 json mode also have this problem. If a json field is string, and model decides to include double quotes in this string, it prematurely ends the field as the regex will match it as an ended string, but model does not think so, so it keep trying to output something other than the next key of the json, which after logit masking by sglang, becomes all tokens that appear as blank in console. Just have any json as regex and instruct the model to output double quotes will trigger this.
I think this is an inherent issue of json string enforcement. There is no good way to tell when a model output ends a "string" since everything is a string. json string can contain any character even special format ones like \n. The model is not trained to escape chars the way json expects so there is no way to tell apart " vs \".
Its a common issue as gpt4 json mode also have this problem. If a json field is string, and model decides to include double quotes in this string, it prematurely ends the field as the regex will match it as an ended string, but model does not think so, so it keep trying to output something other than the next key of the json, which after logit masking by sglang, becomes all tokens that appear as blank in console. Just have any json as regex and instruct the model to output double quotes will trigger this. @aliencaocao We can use the following regular expression to match string values in JSON. I am confused why the " still exists when it is prohibited in this regular expression? I also occured the same problem.
("(?:[^"\\x00-\x1f\x7f-\x9f]|\.)*"|null)
Its a common issue as gpt4 json mode also have this problem. If a json field is string, and model decides to include double quotes in this string, it prematurely ends the field as the regex will match it as an ended string, but model does not think so, so it keep trying to output something other than the next key of the json, which after logit masking by sglang, becomes all tokens that appear as blank in console. Just have any json as regex and instruct the model to output double quotes will trigger this.
@aliencaocao Thank you for your explanation. Do you know where the relevant code is? I hope to have a deeper understanding of this issue in sglang.
As topic suggests.
Using pydantic -> regex
It causes model to subsequently output blank as the logit mask goes on.
Is there any way to escape " produces by the model in a json field?