ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.74k stars 820 forks source link

Convert adds an additional token (= token missmatch to the base model) #542

Open ai-made-approachable opened 5 months ago

ai-made-approachable commented 5 months ago

When I run mlx_lm.convert for berkeley-nest/Starling-LM-7B-alpha my mlx model suddenly has 32003 instead of 32002 tokens. This creates issues if you want to train and later export a .gguf file via llama.cpp

python -m mlx_lm.convert \
--hf-path berkeley-nest/Starling-LM-7B-alpha \
--mlx-path /Volumes/T9/mlx_models/starling-lm7b-alpha-8bit \
-q \
--q-group-size 64 \
--q-bits 8 \
--dtype float16

Original models added_token.json (https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/added_tokens.json)

{
  "<|end_of_turn|>": 32000,
  "<|pad_0|>": 32001
}

added_token.json after converting it to mlx

{
  "<sep>": 32002,
  "<|end_of_turn|>": 32000,
  "<|pad_0|>": 32001
}
mzbac commented 5 months ago

I think there is something to do with the HF tokenizer behavior. I can see that <sep> is in https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/tokenizer_config.json#L55. but it doesn't exist in https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/tokenizer.json. somehow it has been added as new special token.