noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
994 stars 45 forks source link

build_token_enforcer_tokenizer_data takes too long on some tokenizer #73

Closed gsolard closed 4 months ago

gsolard commented 4 months ago

When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer (https://huggingface.co/bigscience/bloom), it takes a really long time to finish.

The problem comes from this loop : https://github.com/noamgat/lm-format-enforcer/blob/7cfb693495e9ba3305e230cc62e05b383b6c717a/lmformatenforcer/tokenizerprefixtree.py#L72

Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete

noamgat commented 4 months ago

A PR that will soon be merged allows limiting the max string length in this section, will improve performance

On Fri, Feb 16, 2024, 13:16 Gautier Solard @.***> wrote:

When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer, it takes a really long time to finish.

The problem comes from this loop : https://github.com/noamgat/lm-format-enforcer/blob/7cfb693495e9ba3305e230cc62e05b383b6c717a/lmformatenforcer/tokenizerprefixtree.py#L72

Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/73, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2AQ6KXB6I5JD42GPXTYT45XBAVCNFSM6AAAAABDL3M4LOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTQMZZGEYDQMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>