Closed chengmonk closed 1 month ago
Curious to know as well
I looked at the source code, and it seems to be related to regular expression matching. English characters are one byte, while Chinese characters are two bytes.
Thanks a bunch for bringing this to our attention. Will definitely dig into this -- I suspect you're right that the regular expressions for JSON strings are a bit naive and are causing the problem.
Would you mind sending a short snippet of JSON that the model should be able to produce but is unable to? Appreciate it!
llamacpp added unicode support here: https://github.com/ggerganov/llama.cpp/pull/2553
{
"document_id": 404,
"causality_list": [
{
"causality_type": "直接",
"cause": {
"actor": "移动卫星公司",
"class": "科技发展",
"action": "获得",
"time": "",
"location": "",
"object": "美国专利和商标办公室颁发的第7181181号专利"
},
"effect": {
"actor": "该专利",
"class": "科技发展",
"action": "覆盖",
"time": "",
"location": "",
"object": "多频带/多模式卫星无线电话通信系统的关键技术"
}
},
{
"causality_type": "直接",
"cause": {
"actor": "该专利",
"class": "科技发展",
"action": "覆盖",
"time": "",
"location": "",
"object": "多频带/多模式卫星无线电话通信系统的关键技术"
},
"effect": {
"actor": "公司",
"class": "科技发展",
"action": "决定使用",
"time": "",
"location": "",
"object": "此技术建造下一代先进卫星系统"
}
}
]
}
The above is the data I want to generate. I used the following JSON schema but was unable to get the Chinese content, only the English version.
{
"type": "object",
"properties": {
"causality_list": {
"type": "array",
"items": {
"type": "object",
"properties": {
"causality_type": {
"type": "string",
"pattern": ".{1,10}"
},
"cause": {
"type": "object",
"properties": {
"actor": {
"type": "string",
"pattern": ".{1,10}"
},
"class": {
"type": "string",
"pattern": ".{1,10}"
},
"action": {
"type": "string",
"pattern": ".{1,10}"
},
"time": {
"type": "string",
"pattern": ".{1,10}"
},
"location": {
"type": "string",
"pattern": ".{1,10}"
},
"object": {
"type": "string",
"pattern": ".{1,10}"
}
}
},
"effect": {
"type": "object",
"properties": {
"actor": {
"type": "string",
"pattern": ".{1,10}"
},
"class": {
"type": "string",
"pattern": ".{1,10}"
},
"action": {
"type": "string",
"pattern": ".{1,10}"
},
"time": {
"type": "string",
"pattern": ".{1,10}"
},
"location": {
"type": "string",
"pattern": ".{1,10}"
},
"object": {
"type": "string",
"pattern": ".{1,10}"
}
}
}
}
}
}
}
}
By the way, how should this issue be resolved? I hope to use Guidance in an upcoming data mining competition. It's a great tool.
Hey @chengmonk I did some testing on my side. Guidance has recently undergone some big changes to JSON handling, none of which are in the last official release (0.1.16), and only some of which are in the latest pre-release (0.2.0rc1). One such unreleased improvement is full (well, fingers crossed 😅) unicode support inside of JSON strings.
If you're daring, I'd suggest installing guidance directly from the main branch (pip install git+https://github.com/guidance-ai/guidance
).
If you do so, I'd really appreciate any feedback. But I believe it should solve your problem, with the caveat that some of the strings in your example above are longer than 10 characters and are therefore invalid under your schema.
Hoping for another official release soon. Please let us know if you need any more help -- I'm really happy you're enjoying guidance as a tool and want to use it in a competition! Exciting!
@Saibo-creator maybe you can try as well? :)
Works well on my side with the current main branch @917fe353bf
import guidance
lm = guidance.models.Transformers(model="microsoft/Phi-3.5-mini-instruct")
lm.echo = False
json_schema = {
"type": "object",
"properties": {
"姓名": {
"type": "string"
},
"年龄": {
"type": "integer"
}
},
"required": ["姓名", "年龄"],
"additionalProperties": False
}
lm = lm + guidance.json(name="unicode_json", schema=json_schema)
print(lm["unicode_json"])
I got
{
"姓名": "张三",
"年龄": 25
}
Fantastic, thanks for checking!
Note to self: add explicit unicode tests to the JSON test suite.
That's so cool, it's working fine now. Truly a great tool.
I've been using the json() function from the guidance library, and while it generates content that adheres to the provided JSON schema, the output is always in English. I'm unable to generate JSON content in other languages, such as Chinese.
Steps to reproduce:
with assistant(): with guidance.silent(): lm += json(name='res2', schema=json_schema) resall.append(lm['res2'])
Even after trying various prompts to instruct the model to generate content in Chinese, the output remains in English. It seems that the json() function does not currently provide an option to control the language of the generated content.
My main questions are:
Is there a way to make the json() function generate content in a language other than English (e.g., Chinese)? If it's not currently supported, are there any plans to add support for multi-language content generation in the future? I would appreciate any clarification or potential solutions regarding this issue. Thank you for developing such a powerful tool!