Currently, building the initial token tree is inefficient and can cause slow ingestion of tokens (for example, a JSON schema). This is evident when using models with large vocab sizes such as cohere command-r, gemma, and qwen. Generation locks up and takes hours to process. These commits help optimize that initial building when creating an ExllamaV2 LMFE filter.
Tests: Run command-r with a JSON schema in TabbyAPI using LMFE v0.9.5, would not start generating. With these commits, generation immediately starts.
Currently, building the initial token tree is inefficient and can cause slow ingestion of tokens (for example, a JSON schema). This is evident when using models with large vocab sizes such as cohere command-r, gemma, and qwen. Generation locks up and takes hours to process. These commits help optimize that initial building when creating an ExllamaV2 LMFE filter.
Tests: Run command-r with a JSON schema in TabbyAPI using LMFE v0.9.5, would not start generating. With these commits, generation immediately starts.
References #75
Thanks @turboderp for creating these commits.