[Feature] Export Lora Adapters as GGML

rmarnold commented 4 weeks ago

lama.cpp dropped support for converting lora to ggml, it would be very useful if we could use adapters with llama.cpp instead of fusing or merging the fine tuned model.

awni commented 3 weeks ago

Can you say more about what you are looking for?

Is it a separate GGUF file which contains the adapters. Then you can load the base model GGUF as well as the adapter GGUF in llama.cpp? Does it llama.cpp support that?

rmarnold commented 3 weeks ago

@awni, Yes, in llama.cpp you can specify --lora ggml-adapter-model.bin. The issue is that after training, mlx_ml outputs adapter.safetensors, which llama.cpp does not recognize. I know for certain that the supported output format is ggml, but I have read there has been work to support gguf.

yonomitt commented 2 weeks ago

It looks like it's a file format called GGLA (possibly a simplified version of GGML? or GGUF?):

https://github.com/ggerganov/llama.cpp/blob/21be9cab94e0b5b53cb6edeeebf8c8c799baad03/examples/export-lora/export-lora.cpp#L225

If I'm reading this correctly, the format is something like this:

HEADER

4 bytes - file type identifier 0x616C6767 (algg)
4 bytes (uint32) - file type version 1
4 bytes (uint32) - LoRA rank
4 bytes (uint32) - LoRA alpha

TENSORS (one after another)

ONE TENSOR METADATA

4 bytes (uint32) - tensor number of dims n_dims
4 bytes (uint32) - tensor name length namelen
4 bytes (uint32) - tensor data type (enum) (i.e 0 - FP32, 1 - FP16)
4 bytes (uint32) * n_dims - length for each tensor dimension
4 bytes * namelen - tensor name

ONE TENSOR DATA (aligned to 32 bytes)

(I'm not entirely certain the format of the data here... if/when I figure more out, I'll add it)

ml-explore / mlx-examples