oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.95k stars 5.34k forks source link

Strange tokenizer results in notebook tab #6558

Open Weker01 opened 2 days ago

Weker01 commented 2 days ago

Describe the bug

I am useing the model MN-12B-Mag-Mell-Q8_0.gguf which has special tokens for chatml but I noticed this with other models too. Token 14 for example is <|im_start|>.

When I run llama.cpp server and querry the /tokenize endpoint manually on <|im_start|> I get the expected token 14, but in the notebook tab this is not the case:

I get the following tokens. Detokenizing with llama.cpp server also reveals that this does indeed also translate to the text <|im_start|>. Also this is the number of tokens counted in the main (Raw) notebook tab.

1      -  ''
1060   -  '<'
1124   -  '|'
1329   -  'im'
18993  -  '_start'
1124   -  '|'
1062   -  '>'

Is there an existing issue for this?

Reproduction

Download a model with special chatml tokens like MN-12B-Mag-Mell or countless others. Type in the special chatml token in the notebook. Go to the Tokens tab and see that the special token is not generated.

Screenshot

No response

Logs

There are no error logs specific to this as far as I know.

System Info

Archlinux
Nvidia
Manual install directly from the git repo.
Weker01 commented 2 days ago

Well I guess it works with llamacpp_HF, there I get the expected tokens. But why can llama.cpp server do this automatically?