Token mis-manipulation for Chinese Characters

noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model

MIT License

1.42k stars 62 forks source link

Token mis-manipulation for Chinese Characters #108

Open yytang220 opened 4 months ago

yytang220 commented 4 months ago

Code is here https://github.com/noamgat/lm-format-enforcer/blob/main/lmformatenforcer/integrations/transformers.py#L69

This line will affect generation for chinese character. Use cleaned = decoded will temporarily solve this problem.

Since I don't know the whole logic, seeking permenant solution.

noamgat commented 4 months ago

Can you give a code sample of a reproduction of a problematic scenario that this line change solves?

yytang220 commented 4 months ago

Can you give a code sample of a reproduction of a problematic scenario that this line change solves?

before： after：

per-token decode using different decode_fn top: with rstrip, bottom: without rstrip

noamgat commented 4 months ago

Hi, thanks for the detailed example. In order to reproduce it in code, can you share the model + prompt + schema that you are using in order to generate text? Alternatively, the tokenizer + token sequence (numbers, not letters) that you get.

yytang220 commented 3 months ago

Hi, thanks for the detailed example. In order to reproduce it in code, can you share the model + prompt + schema that you are using in order to generate text? Alternatively, the tokenizer + token sequence (numbers, not letters) that you get.

token_list : [19788,818,2828,440,13501,4202,8798, 697, 121, 2256] tokenizer: deepseek-coder (https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base)