noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

Emojis unsupported (vLLM integration) #116

Open milesial opened 1 week ago

milesial commented 1 week ago

Hi, using version 0.10.3 and the llama3 tokenizer, with vLLM, I can't seem to constrain to generate emojis.

 curl --request POST \
  --url http://localhost:8000/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "messages": [
    {
      "content": "",
      "role": "user"
    }
  ],
  "guided_decoding_backend": "lm-format-enforcer",
  "guided_choice": ["🐈"],
  "temperature": 0.0,
  "top_p": 0.7,
  "max_tokens": 100,
  "stream": false
}'

[ERROR] Unknown LMFormatEnforcer Problem. Prefix: ''

Even though the tokenizer supports it

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tok.encode("🐈")
[128000, 9468, 238, 230]

It might be related to multi-tokens characters, outlines had to deal with similar issues: https://github.com/outlines-dev/outlines/pull/738

noamgat commented 1 week ago

Yes, this is a known limitation of the approach taken by LM Format Enforcer. I will look into how the outlines PR works and see if we can adapt its approach. If anyone wants to take a crack at it, they are more than welcome :)

On Thu, Jun 27, 2024 at 3:56 AM milesial @.***> wrote:

Hi, using version 0.10.3 and the llama3 tokenizer, with vLLM, I can't seem to constrain to generate emojis.

curl --request POST \ --url http://localhost:8000/v1/chat/completions \ --header 'Content-Type: application/json' \ --data '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "content": "", "role": "user" } ], "guided_decoding_backend": "lm-format-enforcer", "guided_choice": ["🐈"], "temperature": 0.0, "top_p": 0.7, "max_tokens": 100, "stream": false }'

[ERROR] Unknown LMFormatEnforcer Problem. Prefix: ''

Even though the tokenizer supports it

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") tok.encode("🐈") [128000, 9468, 238, 230]

It might be related to multi-tokens characters, outlines had to deal with similar issues: outlines-dev/outlines#738 https://github.com/outlines-dev/outlines/pull/738

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2E33F7S5V5SRKF2NUTZJNPLJAVCNFSM6AAAAABJ64YMKOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3TMNJQGM3TANA . You are receiving this because you are subscribed to this thread.Message ID: @.***>