Closed aspen01 closed 8 months ago
Sorry for this, I'll take a look on Monday and try to reproduce. It should be taking shard into account: https://github.com/okuvshynov/slowllama/blob/main/loader.py#L190 -- see the 'subset'
I did reproduce it. It was because of https://github.com/okuvshynov/slowllama/commit/a12b182bc497718edf0a3a281769815d3585063e#diff-86393d83b45516ef8888170f0149c4c74625a15d65eb084d7f74df2bbf41d1f1L14-L20 changed the names of the keys. These names are used both for reading the original model and for merging LoRA back. The code for load was modified accordingly, but the code for merging was not so we got wrong sharding on merge. Let me test it again end-to-end, and I'll update the code.
Now it works! Thank you for fixing the problem so quickly.
The tensor size mismatch error occurred when I tried to merge the LoRA checkpoint for the llama 2 13B model.
Llama 2 13B has the two consolidated checkpoints, so do we need to consider the shard when we merge the weights?