Open dnhkng opened 2 months ago
This should also work for EXL2 models, assuming you duplicate/rename all the sub-keys for each layer as well. The only difference is that the .weight
tensors are split into .q_weight
, .q_perm
, .q_scale
, .q_scale_max
, .q_groups
and .q_group_map
.
Changes to the config.json should be the same as for a HF model.
Do note that the quantization of each layer is calibrated to the expected output of the previous layer, not to a copy of the same layer, so it's hard to predict how well this works if you're not starting from the original model and quantizing afterwards. But then I guess merges and self-merges were never really predictable to begin with.
I used dynamic relayering and it worked well, but duplicating the model layers in the safetensors didn't work. By dynamic, I mean I load the weights into memory, and then copy.copy the weights, and finally rebuild the cache.
In fact, my best results are from dynamic exl2 experiments. I can't get the same great results even with the original BFloat16 weights!
@turboderp Late update, my models now lead the HuggingFace OpenLLM Leaderboard, under the name RYS.
I have some questions on caching, do you have time for an online chat via Gmeet or Zoom?
Hi Turbo,
I am interested in doing some model self-merges. Currently, I do this with a script with huggingface models.
Basically, I calculate the mapping, eg to duplicate layer 3: {1:1, 2:2,3:3,4:3,5:4}
Then I go through the safetensor files, and duplicate the tensors based on these layer numbers, and generate new keys with the right layer name (eg model.layer.3.up.mlp -> model.layer.6.up.mlp). Finally, I update the model config json with the new number of layers. This works for transformers models, but not for exl2 models. What else would I need to do?