Closed yong326 closed 6 days ago
Something up with this tokenizer.. the len(tokenizer.vocab)
does not match vocab_size
of the model / maximum token id. There are some token ids which don't have values defined in the config, e.g. id 3
.
I started a discussion in their repo to see what the proper way to handle this is: https://huggingface.co/01-ai/Yi-6B-Chat/discussions/5
We can always hack around it in MLX LM, but it would be good to know if those missing tokens should be ignored or if there is some other oversight..
env: M3 Pro
certifi 2024.6.2 charset-normalizer 3.3.2 filelock 3.15.1 fsspec 2024.6.0 huggingface-hub 0.23.4 idna 3.7 Jinja2 3.1.4 MarkupSafe 2.1.5 mlx 0.15.1 mlx-lm 0.14.3 mpmath 1.3.0 networkx 3.3 numpy 2.0.0 packaging 24.1 pip 24.0 protobuf 5.27.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.32.3 safetensors 0.4.3 sentencepiece 0.2.0 setuptools 69.5.1 sympy 1.12.1 tokenizers 0.19.1 torch 2.3.1 tqdm 4.66.4 transformers 4.41.2 typing_extensions 4.12.2 urllib3 2.2.2 wheel 0.43.0
run :
python -m mlx_lm.generate --model 01-ai/Yi-6B-Chat --prompt "hello"
error: ` Traceback (most recent call last): File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/generate.py", line 161, in
main()
File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/generate.py", line 125, in main
model, tokenizer = load(
^^^^^
File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/utils.py", line 456, in load
tokenizer = load_tokenizer(model_path, tokenizer_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/tokenizer_utils.py", line 327, in load_tokenizer
return TokenizerWrapper(
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/tokenizer_utils.py", line 250, in init
self._detokenizer = detokenizer_class(tokenizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/tokenizer_utils.py", line 125, in init
self.tokenmap[tokenid] = value