RuntimeError: The size of tensor a (2560) must match the size of tensor b (5120) at non-singleton dimension 0

okuvshynov / slowllama

Finetune llama2-70b and codellama on MacBook Air without quantization

MIT License

448 stars 34 forks source link

RuntimeError: The size of tensor a (2560) must match the size of tensor b (5120) at non-singleton dimension 0 #16

Closed aspen01 closed 8 months ago

aspen01 commented 8 months ago

The tensor size mismatch error occurred when I tried to merge the LoRA checkpoint for the llama 2 13B model.

Traceback (most recent call last):
  File "/Users/xxx/llama/slowllama-13b/merge_lora.py", line 14, in <module>
    add_lora(model_path, lora_path, out_model_path)
  File "/Users/xxx/llama/slowllama-13b/loader.py", line 190, in add_lora
    checkpoint[checkpoint_key] = checkpoint[checkpoint_key] + lora[subset].to(torch.bfloat16)
RuntimeError: The size of tensor a (2560) must match the size of tensor b (5120) at non-singleton dimension 0

Llama 2 13B has the two consolidated checkpoints, so do we need to consider the shard when we merge the weights?

okuvshynov commented 8 months ago

Sorry for this, I'll take a look on Monday and try to reproduce. It should be taking shard into account: https://github.com/okuvshynov/slowllama/blob/main/loader.py#L190 -- see the 'subset'

okuvshynov commented 8 months ago

I did reproduce it. It was because of https://github.com/okuvshynov/slowllama/commit/a12b182bc497718edf0a3a281769815d3585063e#diff-86393d83b45516ef8888170f0149c4c74625a15d65eb084d7f74df2bbf41d1f1L14-L20 changed the names of the keys. These names are used both for reading the original model and for merging LoRA back. The code for load was modified accordingly, but the code for merging was not so we got wrong sharding on merge. Let me test it again end-to-end, and I'll update the code.

okuvshynov commented 8 months ago

merged https://github.com/okuvshynov/slowllama/commit/4c2996d65407277e590f0a938c9c8e4e44f87bb8#diff-86393d83b45516ef8888170f0149c4c74625a15d65eb084d7f74df2bbf41d1f1R184-R185, please try again.

aspen01 commented 8 months ago

Now it works! Thank you for fixing the problem so quickly.