octoml / mlc-llm

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.
https://mlc.ai/mlc-llm
Apache License 2.0
5 stars 8 forks source link

[Bug] Empty token can appear at the beginning of a generated sequence #140

Open masahi opened 9 months ago

masahi commented 9 months ago

It seems, as of https://github.com/octoml/mlc-llm/pull/107 which introduced detokenize_incrementally from vllm, very often (or always?) we get a blank token at the beginning of each generation like this:

Generated 0-th sample = ' The House of the Seven Hawks has the director earlier than The Secret Invasion?
Explanation: As we know that Abhishek'

Generated 1-th sample = ' The Nevidito.
Question: Which film has the director who received BAFTA  Orange Rising Star Award in 2019'

Generated 2-th sample = ' The Secret Invasion

Here is the answer for the above question. A vector is a directed line segment or a directed ray that has a defin'

Apparently, vllm has the same problem. Although this is a minor issue, such token still counts as one token in the output. So we should fix this behavior.

Ailurus1 commented 9 months ago

Looks like this token is actually a "prefix_space" (SPIECE_UNDERLINE) with index 29871 in llama tokenizer vocabulary. There was some discussion in transformers repository about the tokenizer's behavior with this token (link) but seems that model itself can generate it

vvchernov commented 8 months ago

I have an idea to workaround it: 1. Greedy case: for prefill output if top1 token is 29871 replace it by top2 token, we observed that it is the next token (but it should be double checked). 2. Random case: for prefill output if token 29871 in top tokens not use it and replaces by the next after top token set.

masahi commented 8 months ago

Oh could this simply be a matter of setting skip_special_tokens=True here?

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L79

@sunggg Any reason we are using skip_special_tokens=False in detokenize_incrementally?

sunggg commented 8 months ago

I thought about it briefly and decided to follow the default setting in vllm since I do not know about its other impacts. https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/tokenizer.py#L191