Closed kiranvholla closed 1 month ago
I think you are using an outdated MLX LM. Try pip install -U mlx-lm
and see if that resolved the issue. If not I can reopen and investigate more.
It is a new machine and the libraries are installed couple of weeks back. As I use a conda env, I did a conda update mlx-lm
which forces an upgrade (equivalent to pip install -U mlx-lm
). The library version is unchanged both before and after the upgrade at 0.16.1. The issue is reproducible only for multi-part models due to the way the .safetensor weights are picked up in utils.py code. You may want to have a re-look at the issue description again. For now I deleted the non-quantized safetensor files and have moved on.
Maybe you should try clearing mlx_model
(rm -rf mlx_model
) and redoing the quantization? I wasn't able to reproduce your issue. I quantized the same model and it ran fine for me.
The models should indeed be kept in different directories. So if you quantize and save a model into the same directory as a non-quantized model the loading will have issues.
Thanks Awni. It is possible that you may have quantized directly (without first generating). In such a case, it is likely that the non-quantized file is never downloaded and the program only has to deal with the quantized version. In my case, I did a generation first. This generation command (automatically)downloaded the non-quantized safetensor files. So in my case I ended up having non-quantized as well as quantised versions in my local directory and this caused the loading issue mentioned above. It is a very minor irritant and not a bug but the error message threw me off track for a while before I could get to the root of the issue. As you rightly said, just deleting or moving the quantized files from the .\mlx-lm directory fixed the issue. Thanks for your quick support and thanks for the excellent framework.
Steps to reproduce:
Use a standard generation before quantising: python -m mlx_lm.generate --model microsoft/Phi-3-mini-4k-instruct --max-token 2048 --prompt "<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>"
Confirmed this worked fine (No env issues, missing files etc). It can be seen that the model downloaded is multi-part. I believe Git LFS has a limit of 5GB, so we have model-00001-of-00002.safetensor and model-00002-of-00002.safetensor for the 7.X GB model
Quantize with default settings: python -m mlx_lm.convert --hf-path microsoft/Phi-3-mini-4k-instruct -q
Run the standard generation prompt again (this time with quantized LM) python -m mlx_lm.generate --model ./mlx_model/ --max-token 2048 --prompt "<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>"
We get error - ValueError: Expected shape (32064, 384) but received shape (32064, 3072) for parameter model.embed_tokens.weight. The error originates from utils.py: model = load_model(model_path, lazy, model_config)
Repeat above steps 1-6 for <5 GB model. Say OpenELM-270M. No errors observed. This again eliminates most env related problems (version issues etc). We either have.a quantization issue specific to Phi-3 mini or this is a issue specific to multi-part models
Root source from which issue is (likely)stemming:
Possible solution: