ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.83k stars 827 forks source link

Generation fails after quantization for multi-part models #934

Closed kiranvholla closed 1 month ago

kiranvholla commented 1 month ago

Steps to reproduce:

  1. Use a standard generation before quantising: python -m mlx_lm.generate --model microsoft/Phi-3-mini-4k-instruct --max-token 2048 --prompt "<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>"

  2. Confirmed this worked fine (No env issues, missing files etc). It can be seen that the model downloaded is multi-part. I believe Git LFS has a limit of 5GB, so we have model-00001-of-00002.safetensor and model-00002-of-00002.safetensor for the 7.X GB model

  3. Quantize with default settings: python -m mlx_lm.convert --hf-path microsoft/Phi-3-mini-4k-instruct -q

  4. Run the standard generation prompt again (this time with quantized LM) python -m mlx_lm.generate --model ./mlx_model/ --max-token 2048 --prompt "<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>"

  5. We get error - ValueError: Expected shape (32064, 384) but received shape (32064, 3072) for parameter model.embed_tokens.weight. The error originates from utils.py: model = load_model(model_path, lazy, model_config)

  6. Repeat above steps 1-6 for <5 GB model. Say OpenELM-270M. No errors observed. This again eliminates most env related problems (version issues etc). We either have.a quantization issue specific to Phi-3 mini or this is a issue specific to multi-part models

Root source from which issue is (likely)stemming:

Possible solution:

awni commented 1 month ago

I think you are using an outdated MLX LM. Try pip install -U mlx-lm and see if that resolved the issue. If not I can reopen and investigate more.

kiranvholla commented 1 month ago

It is a new machine and the libraries are installed couple of weeks back. As I use a conda env, I did a conda update mlx-lm which forces an upgrade (equivalent to pip install -U mlx-lm). The library version is unchanged both before and after the upgrade at 0.16.1. The issue is reproducible only for multi-part models due to the way the .safetensor weights are picked up in utils.py code. You may want to have a re-look at the issue description again. For now I deleted the non-quantized safetensor files and have moved on.

awni commented 1 month ago

Maybe you should try clearing mlx_model (rm -rf mlx_model) and redoing the quantization? I wasn't able to reproduce your issue. I quantized the same model and it ran fine for me.

The models should indeed be kept in different directories. So if you quantize and save a model into the same directory as a non-quantized model the loading will have issues.

kiranvholla commented 1 month ago

Thanks Awni. It is possible that you may have quantized directly (without first generating). In such a case, it is likely that the non-quantized file is never downloaded and the program only has to deal with the quantized version. In my case, I did a generation first. This generation command (automatically)downloaded the non-quantized safetensor files. So in my case I ended up having non-quantized as well as quantised versions in my local directory and this caused the loading issue mentioned above. It is a very minor irritant and not a bug but the error message threw me off track for a while before I could get to the root of the issue. As you rightly said, just deleting or moving the quantized files from the .\mlx-lm directory fixed the issue. Thanks for your quick support and thanks for the excellent framework.