Generation fails after quantization for multi-part models

kiranvholla commented 1 month ago

Steps to reproduce:

Use a standard generation before quantising: python -m mlx_lm.generate --model microsoft/Phi-3-mini-4k-instruct --max-token 2048 --prompt "<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>"
Confirmed this worked fine (No env issues, missing files etc). It can be seen that the model downloaded is multi-part. I believe Git LFS has a limit of 5GB, so we have model-00001-of-00002.safetensor and model-00002-of-00002.safetensor for the 7.X GB model
Quantize with default settings: python -m mlx_lm.convert --hf-path microsoft/Phi-3-mini-4k-instruct -q
Run the standard generation prompt again (this time with quantized LM) python -m mlx_lm.generate --model ./mlx_model/ --max-token 2048 --prompt "<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>"
We get error - ValueError: Expected shape (32064, 384) but received shape (32064, 3072) for parameter model.embed_tokens.weight. The error originates from utils.py: model = load_model(model_path, lazy, model_config)
Repeat above steps 1-6 for <5 GB model. Say OpenELM-270M. No errors observed. This again eliminates most env related problems (version issues etc). We either have.a quantization issue specific to Phi-3 mini or this is a issue specific to multi-part models

Root source from which issue is (likely)stemming:

The original Phi model has a shape of (32064, 3072) bfloat weight for the embedding layer
weights.update(mx.load(wf)) in utils.py is the error causing line after drilling down
mx.load seems to be loading quantized as well as non-quantized weights. Returns a non-pythonic 3 dict object
weights.update fails due to shape mismatch for model.embed_tokens.weight: Expected (32064, 384) but received (32064, 3072). Former is shape of quantized version (after quantising, I am guessing every 8 4-bit weight is packed into a uint32). Latter is shape of non-quantised weights - float16.
I couldn't go into the mlx core code due to paucity of tine but I presume the latter dicts are overriding the format dicts in mx.load
So root cause why are 3 dicts given to mlx.load? This appears to stem from utils.py: weight_files = glob.glob(str(model_path / "model*.safetensors"))
In case of single part models, the quantization code overrides the original weights (the file) and hence above command works fine. In multi-part files, the quantized file (due to naming convention) does not override the original weights(files) and we have 3 .safetensor files. When weights are loaded, the program therefore gets both the quantized and non-quantized weights and the latter overrides the former in mx.load

Possible solution:

Like single part models, simply override all the original weight files of a multi-part model during quantization
Better still (since multi-part models would have been downloaded at a bandwidth cost), just move the old safetensor files to a different directory and ensure that they are not picked up when loading quantized weights

awni commented 1 month ago

I think you are using an outdated MLX LM. Try pip install -U mlx-lm and see if that resolved the issue. If not I can reopen and investigate more.

kiranvholla commented 1 month ago

It is a new machine and the libraries are installed couple of weeks back. As I use a conda env, I did a conda update mlx-lm which forces an upgrade (equivalent to pip install -U mlx-lm). The library version is unchanged both before and after the upgrade at 0.16.1. The issue is reproducible only for multi-part models due to the way the .safetensor weights are picked up in utils.py code. You may want to have a re-look at the issue description again. For now I deleted the non-quantized safetensor files and have moved on.

awni commented 1 month ago

Maybe you should try clearing mlx_model (rm -rf mlx_model) and redoing the quantization? I wasn't able to reproduce your issue. I quantized the same model and it ran fine for me.

The models should indeed be kept in different directories. So if you quantize and save a model into the same directory as a non-quantized model the loading will have issues.

kiranvholla commented 1 month ago

Thanks Awni. It is possible that you may have quantized directly (without first generating). In such a case, it is likely that the non-quantized file is never downloaded and the program only has to deal with the quantized version. In my case, I did a generation first. This generation command (automatically)downloaded the non-quantized safetensor files. So in my case I ended up having non-quantized as well as quantised versions in my local directory and this caused the loading issue mentioned above. It is a very minor irritant and not a bug but the error message threw me off track for a while before I could get to the root of the issue. As you rightly said, just deleting or moving the quantized files from the .\mlx-lm directory fixed the issue. Thanks for your quick support and thanks for the excellent framework.

ml-explore / mlx-examples

Generation fails after quantization for multi-part models #934