Open fakerybakery opened 11 months ago
This would be a nice feature.
Count me as interested in this.
It would be great to convert the model files and adapter to a GGUF file.
Yeah. Mlx is super nice, but it is missing the "deploy" part, what do you do after you like your end result and want other people to enjoy it too?
Merging is implemented here https://github.com/mzbac/mlx-lora but I didn't find yet how to convert to gguf
That's not yet supported. We have some on going work for GGUF support, see e.g. https://github.com/ml-explore/mlx/pull/350
question from ignorant person, but why mlx format is different from ggpuf, is there any place I can read that?
MLX has multiple "formats" that we save arrays in. The docs are a bit scattered but you can find the save load functions docs, for example ops page.
We currently support the standard numpy format (along with zip and compressed zip) and safetensors. GGUF is in the pipeline.
is there a way to load mlx into web socket? Like lm studio?
I'm curious if I could serve my own model via mlx into other apps.
Thank you @awni . MLX fine tuning is very good on mistral. A pity we can't get a gguf compatible for llama.cpp. or maye reverse quantisation to HF format?
If the gguf PR is merged, then MLX -> GGUF -> reverse the GGUF convert.py script to create HF model? The convert.py script seems in llama.cpp seems quite complicated, but looks possible.
Succeeded by using fuse.py
python fuse.py —model mlx_model —save-path ./fuse —adapter-file adapater.npz
then rename weights.00.safetensors to model.safetensors.
The convert.py from llama.cpp works fine afterward.
python [convert.py](http://convert.py/) ./fuse
./quantize ./fuse/ggml-model-f16.gguf ./fuse/modelq5.gguf q5_0
@l0d0v1c : I dropped ".fuse' from the python fuse.py step and reformatted the hyphens and got that work. That second part has nothing to do with MLX, correct? I have to get llama.cpp to do the GGUF conversion after renaming the weights.00.safetensors file?
Yes exactly
Can you outline the steps you took in detail? We can see which ones we can improve on our end. For example we could easily change the naming convention to model.safetensors
which might make one step simpler. We could also provide a dequantize option in fuse.py
.
@l0d0v1c I'm struggling with this (I'm a linguist with no computer/data science training). I've cloned the llama.cpp repo. If the fused/renamed model was in /Users/williammarcellino/mlx-examples/lora/lora_fused_model_GrKRoman_1640 how would I format a command to convert to gguf? Thanks in advance for any help :)
@USMCM1A1 you have to clone llama.cpp repo then "make" is enough on mac. rename weights.00 to model python convert.py thedirectoryofyourmodel It will produce a file "ggml-model-f16.gguf" in the same directory Then you can use ./quantize thedirectoryofyourmodel/ggml-model-f16.gguf Thefinal.gguf q4_0
On my experiments on a mlx finetuned model, q8_0 is necessary instead of q4_0
@awni Changing naming convention is a good idea.Another idea is to allow convert just lora to gguf
@USMCM1A1 my project if also linguistic (ancient greek). I'm not computer scientist as well but I play with buttons.
@l0d0v1c Awesome that worked! I have a working gguf_q8 version up and running in LM Studio 😊 Thank you so much.
Also: my ft happens to be on the classical world (Hellenic & Roman).
@USMCM1A1 I work on a AI able to deal with Diogenes and Antisthene philosophy. The results are just incredible. Happy you succeeded. I sent you a linkedin invitation to share about our... unusual subject.
Hi, Is it possible to convert a LoRA model trained with MLX back into the HuggingFace format to publish on the HuggingFace hub, and preferably merge it with the main model? Thank you!