ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.5k stars 791 forks source link

01-ai/Yi-6B-Chat got IndexError: list assignment index out of range #844

Closed yong326 closed 6 days ago

yong326 commented 1 week ago

env: M3 Pro

certifi 2024.6.2 charset-normalizer 3.3.2 filelock 3.15.1 fsspec 2024.6.0 huggingface-hub 0.23.4 idna 3.7 Jinja2 3.1.4 MarkupSafe 2.1.5 mlx 0.15.1 mlx-lm 0.14.3 mpmath 1.3.0 networkx 3.3 numpy 2.0.0 packaging 24.1 pip 24.0 protobuf 5.27.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.32.3 safetensors 0.4.3 sentencepiece 0.2.0 setuptools 69.5.1 sympy 1.12.1 tokenizers 0.19.1 torch 2.3.1 tqdm 4.66.4 transformers 4.41.2 typing_extensions 4.12.2 urllib3 2.2.2 wheel 0.43.0

run : python -m mlx_lm.generate --model 01-ai/Yi-6B-Chat --prompt "hello"

error: ` Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/generate.py", line 161, in main() File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/generate.py", line 125, in main model, tokenizer = load( ^^^^^ File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/utils.py", line 456, in load tokenizer = load_tokenizer(model_path, tokenizer_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/tokenizer_utils.py", line 327, in load_tokenizer return TokenizerWrapper( ^^^^^^^^^^^^^^^^^ File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/tokenizer_utils.py", line 250, in init self._detokenizer = detokenizer_class(tokenizer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/anaconda3/envs/mlx/lib/python3.11/site-packages/mlx_lm/tokenizer_utils.py", line 125, in init self.tokenmap[tokenid] = value


IndexError: list assignment index out of range
`
awni commented 1 week ago

Something up with this tokenizer.. the len(tokenizer.vocab) does not match vocab_size of the model / maximum token id. There are some token ids which don't have values defined in the config, e.g. id 3.

awni commented 1 week ago

I started a discussion in their repo to see what the proper way to handle this is: https://huggingface.co/01-ai/Yi-6B-Chat/discussions/5

We can always hack around it in MLX LM, but it would be good to know if those missing tokens should be ignored or if there is some other oversight..