turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.5k stars 267 forks source link

Manual model merges #555

Open dnhkng opened 2 months ago

dnhkng commented 2 months ago

Hi Turbo,

I am interested in doing some model self-merges. Currently, I do this with a script with huggingface models.

Basically, I calculate the mapping, eg to duplicate layer 3: {1:1, 2:2,3:3,4:3,5:4}

Then I go through the safetensor files, and duplicate the tensors based on these layer numbers, and generate new keys with the right layer name (eg model.layer.3.up.mlp -> model.layer.6.up.mlp). Finally, I update the model config json with the new number of layers. This works for transformers models, but not for exl2 models. What else would I need to do?

turboderp commented 2 months ago

This should also work for EXL2 models, assuming you duplicate/rename all the sub-keys for each layer as well. The only difference is that the .weight tensors are split into .q_weight, .q_perm, .q_scale, .q_scale_max, .q_groups and .q_group_map.

Changes to the config.json should be the same as for a HF model.

Do note that the quantization of each layer is calibrated to the expected output of the previous layer, not to a copy of the same layer, so it's hard to predict how well this works if you're not starting from the original model and quantizing afterwards. But then I guess merges and self-merges were never really predictable to begin with.

dnhkng commented 2 months ago

I used dynamic relayering and it worked well, but duplicating the model layers in the safetensors didn't work. By dynamic, I mean I load the weights into memory, and then copy.copy the weights, and finally rebuild the cache.

In fact, my best results are from dynamic exl2 experiments. I can't get the same great results even with the original BFloat16 weights!

dnhkng commented 3 days ago

@turboderp Late update, my models now lead the HuggingFace OpenLLM Leaderboard, under the name RYS.

I have some questions on caching, do you have time for an online chat via Gmeet or Zoom?