Add utf-8 character support to JsonSchemaParser KEYS

noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model

MIT License

1.01k stars 46 forks source link

Add utf-8 character support to JsonSchemaParser KEYS #30

Open sanixa opened 7 months ago

sanixa commented 7 months ago

Addition by library author start

This issue talked about UTF-8 character support. The issue is not with string values, but with string keys, which are not currently supported by the library. This issue has been modified to request it. Vote if interested!

Addition by library author end

Is it possible utf-8 character support? or chinese support?

In my case, i want the model to answer chinese character in json format or plain text, and may limit the output according utf-8 code like [\u4e00-\u9fa5].

I have try multiple model with vllm integration, like teknium_OpenHermes-2-Mistral-7B and TheBloke_airoboros-l2-70B-gpt4-1.4.1-AWQ. None of them can answer chinese character.

Thanks.

noamgat commented 7 months ago

Can you check if you are able to generate chinese characters using json schema + huggingface transformers (you can use the colab notebook to try it online)?

If so, it might be related to the vLLM integration, as I have generated unicode characters (emojis) in json schema. This should be possible, at least in JSON Schema mode (not sure about regex, its harder there)

sanixa commented 7 months ago

Here is my testing. notebook

I have been successfully to get chinese output with both transformers and vLLM. In order to get chinese output, i have to change both prompt and json schema key to chinese.

However, as shown in above test notebook, if i push some chinese character in json schema key, it might be failed. At least "姓氏" and "在nba工作幾季" lead the failure in my test.

noamgat commented 7 months ago

This is indeed a limitation - due to how out-of-tokenizer-vocabulary characters work in LLMs, it will be a challenge to support them in the current design of lm-format-enforcer. I renamed the task to become the feature request. Vote if interested!

vanryan commented 4 months ago

Any way or direction to support utf-8? It will be great if this feature can be added. Happy to discuss and learn about the details. Thanks.

ckhfor commented 3 months ago

Can we change the character based parser to token_id based?