Closed mgoin closed 2 months ago
This works as intended. The user needs to pass in the tokenizer so we can calculate the full prompt count as accurately as possible. Otherwise, we fall back on Llama 3.1 as a reasonable default that will not affect the numbers too much.
It's not fully safe to assume that the model passed in will always be a publicly available model or a name that matches a publicly available one and the default is to pass in to the tokenizer arg what you are using
@markurtz can you require that a tokenizer is provided then? It seems like bad behavior to use Llama 3.1 silently
How do we feel about following the standard behavior that has been already adopted by most popular libraries these days (transformers, vllm, llm-foundry, lm-evaluation-harness, etc.) where the default behavior is to load the tokenizer of the given model, but we leave an option for a user to explicitly override it with --tokenizer
arg (in case they really want to try something exotic)?
For example, if a user specifies model=mistralai/Mistral-7B-Instruct-v0.3
, then we load its corresponding tokenizer.
@eldarkurtic, thanks! I'm worried about people evaluating private models or iterations on public models and the system crashing on a more obscure error when not being able to access the tokenizer for a private model. Given the tokenizers are relatively similar, the plan was a Llama 3.1 base would be accurate enough for most benchmarks if they did not supply it. Let me see if I can rework the logic in the main script to raise a helpful error if this is the case and instruct the user to pass in the tokenizer.
My main concern with using a tokenizer from a different model is the possibility of introducing errors that are difficult for end users to identify. For example, if two models have different vocabulary sizes (Mistral with 32k and Llama with 128k), there is a high chance that the Llama tokenizer will produce invalid tokens for the Mistral model (all tokens > 32k). This could lead to indexing issues in the input embedding layer, as embedding vectors for tokens > 32k do not exist in the Mistral model.
Ah, @eldarkurtic, the results from the tokenizer are not passed through to the server. The tokenizer is purely used for calculating the prompt length in tokens so it sends the correct length of text to the server.
Oh nice, then this is definitely not going to be an issue. Thanks for clarifying @markurtz .
It looks like when setting up a default guidellm run, there is no error that the tokenizer needs to be set as well. After running my sweep, I noticed in the output that a Llama tokenizer was used when I was benchmarking a Mistral model.
Full output