Closed gsolard closed 4 months ago
A PR that will soon be merged allows limiting the max string length in this section, will improve performance
On Fri, Feb 16, 2024, 13:16 Gautier Solard @.***> wrote:
When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer, it takes a really long time to finish.
The problem comes from this loop : https://github.com/noamgat/lm-format-enforcer/blob/7cfb693495e9ba3305e230cc62e05b383b6c717a/lmformatenforcer/tokenizerprefixtree.py#L72
Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete
— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/73, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2AQ6KXB6I5JD42GPXTYT45XBAVCNFSM6AAAAABDL3M4LOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTQMZZGEYDQMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
When using the function build_token_enforcer_tokenizer_data on the bloom tokenizer (https://huggingface.co/bigscience/bloom), it takes a really long time to finish.
The problem comes from this loop : https://github.com/noamgat/lm-format-enforcer/blob/7cfb693495e9ba3305e230cc62e05b383b6c717a/lmformatenforcer/tokenizerprefixtree.py#L72
Indeed one decoded token in the bloom tokenizer have length 600 and several have a length superior to 100 (compared to 16 maximum for llama for example). So the double loop is really long to complete