Open sanixa opened 7 months ago
Can you check if you are able to generate chinese characters using json schema + huggingface transformers (you can use the colab notebook to try it online)?
If so, it might be related to the vLLM integration, as I have generated unicode characters (emojis) in json schema. This should be possible, at least in JSON Schema mode (not sure about regex, its harder there)
Here is my testing. notebook
I have been successfully to get chinese output with both transformers and vLLM. In order to get chinese output, i have to change both prompt and json schema key to chinese.
However, as shown in above test notebook, if i push some chinese character in json schema key, it might be failed. At least "姓氏" and "在nba工作幾季" lead the failure in my test.
This is indeed a limitation - due to how out-of-tokenizer-vocabulary characters work in LLMs, it will be a challenge to support them in the current design of lm-format-enforcer. I renamed the task to become the feature request. Vote if interested!
Any way or direction to support utf-8? It will be great if this feature can be added. Happy to discuss and learn about the details. Thanks.
Can we change the character based parser to token_id based?
Addition by library author start
This issue talked about UTF-8 character support. The issue is not with string values, but with string keys, which are not currently supported by the library. This issue has been modified to request it. Vote if interested!
Addition by library author end
Is it possible utf-8 character support? or chinese support?
In my case, i want the model to answer chinese character in json format or plain text, and may limit the output according utf-8 code like [\u4e00-\u9fa5].
I have try multiple model with vllm integration, like teknium_OpenHermes-2-Mistral-7B and TheBloke_airoboros-l2-70B-gpt4-1.4.1-AWQ. None of them can answer chinese character.
Thanks.