ray-project / llmperf

LLMPerf is a library for validating and benchmarking LLMs
Apache License 2.0
659 stars 113 forks source link

bug of counting output tokens #35

Open irasin opened 9 months ago

irasin commented 9 months ago

In https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L62, you have

get_token_length = lambda text: len(tokenizer.encode(text))

and then use it to calculate the tokens of the generated text, like

num_output_tokens = get_token_length(gen_text)

However, tokenzier will add special tokens during encoding phrase, here is an example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

text = "hello world"

enc = tokenizer.encode(text)
print(enc) # [1, 22172, 3186]

dec = tokenizer.decode(enc)
print(dec) # <s> hello world

The token 1 should not be counted as one generated token here.

BTW, different models may have different special token rules, so it's hard to determine an easy way to get the real token number of the generated text.