neuralmagic / guidellm

Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
Apache License 2.0
159 stars 11 forks source link

Wrong tokenizer used by default #37

Closed mgoin closed 2 months ago

mgoin commented 2 months ago

It looks like when setting up a default guidellm run, there is no error that the tokenizer needs to be set as well. After running my sweep, I noticed in the output that a Llama tokenizer was used when I was benchmarking a Mistral model.

│ │ Backend(type=openai_server, target=http://localhost:8000/v1, model=mistralai/Mistral-7B-Instruct-v0.3)                                                                  │ │
│ │ Data(type=emulated, source=None, tokenizer=neuralmagic/Meta-Llama-3.1-8B-FP8)   

Full output

guidellm --target "http://localhost:8000/v1" --model mistralai/Mistral-7B-Instruct-v0.3
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.5k/50.5k [00:00<00:00, 1.79MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 34.2MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 335/335 [00:00<00:00, 5.62MB/s]
╭─ Benchmarks ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [15:58:06]   100% synchronous            (1.22 req/sec avg)                                                                                                                 │
│ [16:00:06]   100% throughput             (17.78 req/sec avg)                                                                                                                │
│ [16:00:47]   100% constant@3.06 req/s    (3.05 req/sec avg)                                                                                                                 │
│ [16:02:47]   100% constant@4.90 req/s    (4.87 req/sec avg)                                                                                                                 │
│ [16:04:47]   100% constant@6.74 req/s    (6.75 req/sec avg)                                                                                                                 │
│ [16:06:47]   100% constant@8.58 req/s    (8.53 req/sec avg)                                                                                                                 │
│ [16:08:48]   100% constant@10.42 req/s   (10.40 req/sec avg)                                                                                                                │
│ [16:11:02]   100% constant@12.26 req/s   (13.86 req/sec avg)                                                                                                                │
│ [16:12:47]   100% constant@14.10 req/s   (14.01 req/sec avg)                                                                                                                │
│ [16:14:47]   100% constant@15.94 req/s   (15.74 req/sec avg)                                                                                                                │
│ [16:16:47]   100% constant@17.78 req/s   (17.26 req/sec avg)                                                                                                                │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  Generating report... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (11/11) [ 0:20:40 < 0:00:00 ]
╭─ GuideLLM Benchmarks Report (guidance_report.json) ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ╭─ Benchmark Report 1 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ Backend(type=openai_server, target=http://localhost:8000/v1, model=mistralai/Mistral-7B-Instruct-v0.3)                                                                  │ │
│ │ Data(type=emulated, source=None, tokenizer=neuralmagic/Meta-Llama-3.1-8B-FP8)                                                                                           │ │
│ │ Rate(type=sweep, rate=None)                                                                                                                                             │ │
│ │ Limits(max_number=None requests, max_duration=120 sec)                                                                                                                  │ │
│ │                                                                                                                                                                         │ │
│ │                                                                                                                                                                         │ │
│ │ Requests Data by Benchmark                                                                                                                                              │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓                                                               │ │
│ │ ┃ Benchmark                  ┃ Requests Completed ┃ Request Failed ┃ Duration   ┃ Start Time ┃ End Time ┃                                                               │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩                                                               │ │
│ │ │ synchronous                │ 145/145            │ 0/145          │ 118.86 sec │ 15:58:06   │ 16:00:05 │                                                               │ │
│ │ │ asynchronous@3.06 req/sec  │ 366/366            │ 0/366          │ 119.85 sec │ 16:00:47   │ 16:02:47 │                                                               │ │
│ │ │ asynchronous@4.90 req/sec  │ 584/584            │ 0/584          │ 119.99 sec │ 16:02:47   │ 16:04:47 │                                                               │ │
│ │ │ asynchronous@6.74 req/sec  │ 803/803            │ 0/803          │ 119.03 sec │ 16:04:47   │ 16:06:46 │                                                               │ │
│ │ │ asynchronous@8.58 req/sec  │ 1022/1022          │ 0/1022         │ 119.78 sec │ 16:06:47   │ 16:08:47 │                                                               │ │
│ │ │ asynchronous@10.42 req/sec │ 1188/1188          │ 0/1188         │ 114.27 sec │ 16:08:48   │ 16:10:42 │                                                               │ │
│ │ │ asynchronous@12.26 req/sec │ 1455/1455          │ 0/1455         │ 105.00 sec │ 16:11:02   │ 16:12:47 │                                                               │ │
│ │ │ asynchronous@14.10 req/sec │ 1677/1677          │ 0/1677         │ 119.69 sec │ 16:12:47   │ 16:14:47 │                                                               │ │
│ │ │ asynchronous@15.94 req/sec │ 1887/1887          │ 0/1887         │ 119.87 sec │ 16:14:47   │ 16:16:47 │                                                               │ │
│ │ │ asynchronous@17.78 req/sec │ 856/856            │ 0/856          │ 49.59 sec  │ 16:16:47   │ 16:17:37 │                                                               │ │
│ │ │ throughput                 │ 725/725            │ 0/725          │ 40.77 sec  │ 16:00:06   │ 16:00:47 │                                                               │ │
│ │ └────────────────────────────┴────────────────────┴────────────────┴────────────┴────────────┴──────────┘                                                               │ │
│ │                                                                                                                                                                         │ │
│ │ Tokens Data by Benchmark                                                                                                                                                │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                             │ │
│ │ ┃ Benchmark                  ┃ Prompt  ┃ Prompt (1%, 5%, 50%, 95%, 99%)         ┃ Output ┃ Output (1%, 5%, 50%, 95%, 99%) ┃                                             │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                             │ │
│ │ │ synchronous                │ 1024.38 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 69.75  │ 23.3, 28.0, 56.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@3.06 req/sec  │ 1024.33 │ 1024.0, 1024.0, 1024.0, 1025.8, 1026.0 │ 83.66  │ 23.0, 26.0, 58.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@4.90 req/sec  │ 1024.38 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 83.77  │ 23.8, 26.0, 57.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@6.74 req/sec  │ 1024.31 │ 1024.0, 1024.0, 1024.0, 1025.0, 1026.0 │ 76.63  │ 22.0, 25.0, 58.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@8.58 req/sec  │ 1024.36 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 76.33  │ 22.0, 25.0, 57.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@10.42 req/sec │ 1024.35 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 79.26  │ 22.0, 26.0, 57.5, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@12.26 req/sec │ 1024.34 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 77.65  │ 22.0, 26.0, 57.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@14.10 req/sec │ 1024.35 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 77.59  │ 21.0, 26.0, 57.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@15.94 req/sec │ 1024.37 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 79.33  │ 22.0, 26.0, 56.0, 256.0, 256.0 │                                             │ │
│ │ │ asynchronous@17.78 req/sec │ 1024.35 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 76.88  │ 22.0, 26.0, 58.0, 256.0, 256.0 │                                             │ │
│ │ │ throughput                 │ 1024.38 │ 1024.0, 1024.0, 1024.0, 1026.0, 1026.0 │ 82.61  │ 23.0, 26.2, 58.0, 256.0, 256.0 │                                             │ │
│ │ └────────────────────────────┴─────────┴────────────────────────────────────────┴────────┴────────────────────────────────┘                                             │ │
                                                                                      ...                                       
markurtz commented 2 months ago

This works as intended. The user needs to pass in the tokenizer so we can calculate the full prompt count as accurately as possible. Otherwise, we fall back on Llama 3.1 as a reasonable default that will not affect the numbers too much.

It's not fully safe to assume that the model passed in will always be a publicly available model or a name that matches a publicly available one and the default is to pass in to the tokenizer arg what you are using

mgoin commented 2 months ago

@markurtz can you require that a tokenizer is provided then? It seems like bad behavior to use Llama 3.1 silently

eldarkurtic commented 2 months ago

How do we feel about following the standard behavior that has been already adopted by most popular libraries these days (transformers, vllm, llm-foundry, lm-evaluation-harness, etc.) where the default behavior is to load the tokenizer of the given model, but we leave an option for a user to explicitly override it with --tokenizer arg (in case they really want to try something exotic)?

For example, if a user specifies model=mistralai/Mistral-7B-Instruct-v0.3, then we load its corresponding tokenizer.

markurtz commented 2 months ago

@eldarkurtic, thanks! I'm worried about people evaluating private models or iterations on public models and the system crashing on a more obscure error when not being able to access the tokenizer for a private model. Given the tokenizers are relatively similar, the plan was a Llama 3.1 base would be accurate enough for most benchmarks if they did not supply it. Let me see if I can rework the logic in the main script to raise a helpful error if this is the case and instruct the user to pass in the tokenizer.

eldarkurtic commented 2 months ago

My main concern with using a tokenizer from a different model is the possibility of introducing errors that are difficult for end users to identify. For example, if two models have different vocabulary sizes (Mistral with 32k and Llama with 128k), there is a high chance that the Llama tokenizer will produce invalid tokens for the Mistral model (all tokens > 32k). This could lead to indexing issues in the input embedding layer, as embedding vectors for tokens > 32k do not exist in the Mistral model.

markurtz commented 2 months ago

Ah, @eldarkurtic, the results from the tokenizer are not passed through to the server. The tokenizer is purely used for calculating the prompt length in tokens so it sends the correct length of text to the server.

eldarkurtic commented 2 months ago

Oh nice, then this is definitely not going to be an issue. Thanks for clarifying @markurtz .