ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.8k stars 824 forks source link

Fused & Uploaded Model Losing Coherence #346

Open USMCM1A1 opened 7 months ago

USMCM1A1 commented 7 months ago

I noticed today that when I use python -m mlx_lm.generate the output doesn't match what I get locally using python lora.py. For example: Local output using lora adapters:

(base) Williams-MacBook-Pro-3:lora williammarcellino$ python lora.py --model /Users/williammarcellino/mlx-examples/lora/mlx_model \
>                --adapter-file /Users/williammarcellino/mlx-examples/lora/adapters.npz \
>                --max-tokens 250 \
>                --temp .7 \
>                --prompt "
> Q: How does Aristotle define the soul?
> A: "
Loading pretrained model
Total parameters 7243.436M
Trainable parameters 1.704M
Loading datasets
Generating

Q: How does Aristotle define the soul?
A: 

Aristotle defines the soul as "the first act of a living body, which has in itself the faculty of movement and sensation."

This definition encompasses two key elements: 

1)   The soul is 
the first act or principle of life in a living being. 

2)  The soul is 
the faculties of movement and sensation in a living being. 

Let's explore these two aspects further:

1. The Soul as the First Act:

Aristotle views the soul as the first and most fundamental aspect of a living being. 
He argues that the soul is the source of life, and that it is responsible for all 
the activities and functions of a living organism. 

In this sense, the soul is not a separate entity from the body, but rather 
the animating principle that gives rise to life in a living being. 

2. The Soul as the Faculties of Movement and Sensation:

Aristotle defines the soul as the faculties of movement and sensation in a 
living being.

Whereas the same prompt and arguments produces a pretty weird, incoherent response:

(base) Williams-MacBook-Pro-3:mlx_lm williammarcellino$ python -m mlx_lm.generate \
>                --max-tokens 250 \
>                --temp .7 \
> --prompt "
> > Q: How does Aristotle define the soul?
> > A: " \
> --model mlx-community/mistral-7b-v0.2-GreeceRome-v0
Fetching 6 files: 100%|██████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 21290.88it/s]
==========
Prompt: 
> Q: How does Aristotle define the soul?
> A: 

Aristotle defines the soul as the first principle and cause of the body's life-giving activities. 

He considers the soul to be a substance that is inherent in every living body and is the actuality of all its processes. 

The soul is the form and final cause of the organism, and the efficient cause of its operations and growth. 

Aristotle's definition of the soul is complex and multifaceted, encompassing both material and formal aspects. 

==========
Prompt: 52.869 tokens-per-sec
Generation: 19.099 tokens-per-sec
(base) Williams-MacBook-Pro-3:mlx_lm williammarcellino$ 

In particular it looks like the model forced out a bunch of blank tokens using mlx_lm.generate.

mzbac commented 7 months ago

This mostly happens to Qlora due to dequantization. The fused model will generate slightly different text. I have also noticed that quantizing the model with lm_head sometimes causes significant performance issues on the fused model for smaller models. Here are some options I can think of: 1) Stop quantizing lm_head. 2) Provide a load_with_adapter API in mlx-lm so that users can load an adapter with the base model to avoid the de-quantization issue.

@Awni, do you have any suggestions or thoughts?

USMCM1A1 commented 7 months ago
  1. Stop quantizing lm_head.
  2. Provide a load_with_adapter API in mlx-lm so that users can load an adapter with the base model to avoid the de-quantization issue.

Thanks @mzbac for responding, but I'm a little confused about the quantizing part. When I converted the base model to mlx I didn't use a --q argument: python convert.py --torch-path /Users/williammarcellino/mlx-examples/llms/mistral/mistral-7B-v0.1

So it shouldn't have been a quantized model, right?

mzbac commented 7 months ago
  1. Stop quantizing lm_head.
  2. Provide a load_with_adapter API in mlx-lm so that users can load an adapter with the base model to avoid the de-quantization issue.

Thanks @mzbac for responding, but I'm a little confused about the quantizing part. When I converted the base model to mlx I didn't use a --q argument: python convert.py --torch-path /Users/williammarcellino/mlx-examples/llms/mistral/mistral-7B-v0.1

So it shouldn't have been a quantized model, right?

Yeah, it looks like you're just doing lora. In that case, could you please check if you have specified the correct adapter-file path when fusing the model?

USMCM1A1 commented 7 months ago

I used the same adapters file.

mzbac commented 7 months ago

During the fusing, the Lora will be merged into the original linear layer, which may cause some slight differences when using the fused model. However, there shouldn't be such a big difference. I suspect the fusing wasn't done properly somehow.

mzbac commented 7 months ago

@USMCM1A1 I double checked the fuse.py and found that there may be a bug when you are not using default lora layers. Before I can create a PR to fix it, you can update the code: https://github.com/ml-explore/mlx-examples/blob/main/lora/fuse.py#L56 to for l in model.model.layers[len(model.model.layers)-lora_layers:]. Let me know if that could fix the issue.