okuvshynov / slowllama

Finetune llama2-70b and codellama on MacBook Air without quantization
MIT License
448 stars 34 forks source link

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' #15

Open aspen01 opened 8 months ago

aspen01 commented 8 months ago

In order to merge LoRA checkpoint for llama 2 7B model, I run python merge_lora.py.

But an error occured,

Traceback (most recent call last):
  File "/Users/xxx/llama/slowllama/merge_lora.py", line 14, in <module>
    add_lora(model_path, lora_path, out_model_path)
  File "/Users/xxx/llama/slowllama/loader.py", line 188, in add_lora
    lora = lora_weights[b_key].mm(lora_weights[a_key]) * lora_scale
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

So I modified the code like below and I got the merged model file.

lora = lora_weights[b_key].to(torch.float32).mm(lora_weights[a_key].to(torch.float32)) * lora_scale

But I wonder it's okay or not. Can you give the opinion or right solution?

okuvshynov commented 8 months ago

Oh interesting, thank you. Let me take a look

okuvshynov commented 8 months ago

What types were you using for finetuning? I think you'll need to double-check that merged weights produce the same outputs as non-merged (https://github.com/okuvshynov/slowllama?tab=readme-ov-file#merging-lora-weights-back).

aspen01 commented 8 months ago

I used the types of conf_fp16.py.

adamw_eps = 1e-4
compute_dtype = torch.float16
frozen_dtype = torch.float16
okuvshynov commented 8 months ago

Got it. I think I'll need to try it myself to double-check (we transform weights fp16->fp32->bf16), but if the merged model produces reasonable output it should be ok.

aspen01 commented 8 months ago

Sometimes the merged model produces the expected results. But I don't know whether the unexpected results are due to the merged weights or insufficient fine-tuning.

okuvshynov commented 8 months ago

As I understand, you are doing finetuning on CPU? I'm not sure if there's any benefit of using fp16, if the underlying architecture doesn't support it natively.

aspen01 commented 8 months ago

I'm testing fine-tuning on an Apple M1 and I know that it uses the GPU during fine-tuning. I tried fine-tuning using CPU in llama.cpp, but slowllama takes less learning time, so I want to try fine-tuning with slowllama.

okuvshynov commented 8 months ago

Can you still reproduce this after our fix in https://github.com/okuvshynov/slowllama/issues/16? When I tried it that time on apple m1 i didn't have to convert to f32 and back.

Nirjhor27 commented 8 months ago

Actually, I tried it yesterday on m2 ultra and had the same issue, I had to do the float32 conversion and that solved it.

okuvshynov commented 8 months ago

@Nirjhor27 Interesting! which torch version are you using? The error essentially says that fp16 operations are not implemented for CPU. On my M1/M2 devices I can do that though:

>>> import torch
>>> torch.__version__
'2.2.1'
>>> a = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> b = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> a.mm(b)
tensor([[0.3838, 1.0488],
        [0.0728, 0.4006]], dtype=torch.float16)

Does this snippet work for you?

Nirjhor27 commented 8 months ago

I am using 2.1.2. And nope, running the snippet results in: Traceback (most recent call last): File "", line 1, in RuntimeError: "addmm_implcpu" not implemented for 'Half'

I understood that fp16 is for gpu and not cpu, but I am also worried if doing the conversion as Aspent suggested will mess up the weights when merging. I could merge after doing the float 32 conversion and the merged model appears to be working fine, but I have the same question as Aspen -> if it's actually okay or not.

okuvshynov commented 8 months ago

Interesting, maybe it has something to do with recent work in torch, e.g. https://github.com/pytorch/pytorch/commit/2240018c03744ee34ea14ad53481db934c37e384. I cannot test older torch version right now, I'll need to downgrade python as well.

I'll make a change to detect if device supports fp16. Alternatively we can run the merge_lora on mps device as well.

Nirjhor27 commented 8 months ago

Thanks, will keep an eye out and update if I find an alternative than the float32 conversion.

okuvshynov commented 8 months ago

https://github.com/okuvshynov/slowllama/commit/f055a88bdd096bf83ee47615082e26dd25b53a77

I suspect the result might be a little different, but not sure how big of a difference will it make.

Btw, @Nirjhor27 - if you've used m2 ultra, what was the GPU utilization when you tried to finetune? Thank you!

Nirjhor27 commented 8 months ago

I haven't checked it yet (I am using a remote client) - however, I am planning/have to check it very soon when I finetune again, I will update you on that.