Open aspen01 opened 8 months ago
Oh interesting, thank you. Let me take a look
What types were you using for finetuning? I think you'll need to double-check that merged weights produce the same outputs as non-merged (https://github.com/okuvshynov/slowllama?tab=readme-ov-file#merging-lora-weights-back).
I used the types of conf_fp16.py
.
adamw_eps = 1e-4
compute_dtype = torch.float16
frozen_dtype = torch.float16
Got it. I think I'll need to try it myself to double-check (we transform weights fp16->fp32->bf16), but if the merged model produces reasonable output it should be ok.
Sometimes the merged model produces the expected results. But I don't know whether the unexpected results are due to the merged weights or insufficient fine-tuning.
As I understand, you are doing finetuning on CPU? I'm not sure if there's any benefit of using fp16, if the underlying architecture doesn't support it natively.
I'm testing fine-tuning on an Apple M1 and I know that it uses the GPU during fine-tuning. I tried fine-tuning using CPU in llama.cpp, but slowllama takes less learning time, so I want to try fine-tuning with slowllama.
Can you still reproduce this after our fix in https://github.com/okuvshynov/slowllama/issues/16? When I tried it that time on apple m1 i didn't have to convert to f32 and back.
Actually, I tried it yesterday on m2 ultra and had the same issue, I had to do the float32 conversion and that solved it.
@Nirjhor27 Interesting! which torch version are you using? The error essentially says that fp16 operations are not implemented for CPU. On my M1/M2 devices I can do that though:
>>> import torch
>>> torch.__version__
'2.2.1'
>>> a = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> b = torch.rand(2, 2).to(torch.float16).to('cpu')
>>> a.mm(b)
tensor([[0.3838, 1.0488],
[0.0728, 0.4006]], dtype=torch.float16)
Does this snippet work for you?
I am using 2.1.2.
And nope, running the snippet results in:
Traceback (most recent call last):
File "
I understood that fp16 is for gpu and not cpu, but I am also worried if doing the conversion as Aspent suggested will mess up the weights when merging. I could merge after doing the float 32 conversion and the merged model appears to be working fine, but I have the same question as Aspen -> if it's actually okay or not.
Interesting, maybe it has something to do with recent work in torch, e.g. https://github.com/pytorch/pytorch/commit/2240018c03744ee34ea14ad53481db934c37e384. I cannot test older torch version right now, I'll need to downgrade python as well.
I'll make a change to detect if device supports fp16. Alternatively we can run the merge_lora on mps device as well.
Thanks, will keep an eye out and update if I find an alternative than the float32 conversion.
https://github.com/okuvshynov/slowllama/commit/f055a88bdd096bf83ee47615082e26dd25b53a77
I suspect the result might be a little different, but not sure how big of a difference will it make.
Btw, @Nirjhor27 - if you've used m2 ultra, what was the GPU utilization when you tried to finetune? Thank you!
I haven't checked it yet (I am using a remote client) - however, I am planning/have to check it very soon when I finetune again, I will update you on that.
In order to merge LoRA checkpoint for llama 2 7B model, I run
python merge_lora.py
.But an error occured,
So I modified the code like below and I got the merged model file.
But I wonder it's okay or not. Can you give the opinion or right solution?