Open LiuPearl1 opened 1 year ago
Apart from the --group_by_length param, it's very similair to how I run it, and your batch params seem consistent.
Loss at 0 definitely is weird.
I'd try clearing the .hf dataset cache, and maybe depending on your transformers version try the alternate llama model (you know, because of the tokenizer change recently, see https://github.com/tloen/alpaca-lora/issues/286#issuecomment-1499863545 )
@AngainorDev When I run 7B model finetuning, training loss has converged. But 13B model finetuing doesn't work.
I have checked the transformer version, it's also 4.28.0.dev.
Also I have seen someone uses V100 GPU met this problems. Someone guesses V100 can't support int8 training, but when I train 7B model on V100, int8 training can work. Based on this, I think it's not V100 causing error. Do you have any other infers?
I am getting the same issue int8 training on 7B model and the training loss drops to 0 as well. Does it mean for v100 gpu card we have to fine tune on the full model (I.e drop the load_in_8bit?) thanks!
finetuning on 4xV100 GPU ,I'm having the same issue here, train loss stay at 0.0 eval loss stay at nan, right from the start to the end.
@keelezibel @bupticybee When I finetune 7B model, set lora_target_modules = ['q_proj','k_proj','v_proj','o_proj'] will avoid loss always 0.0. But when I finetune 13B model, set lora_target_modules= ['q_proj','k_proj','v_proj','o_proj'] doesn't work.
@keelezibel @bupticybee When I finetune 7B model, set lora_target_modules = ['q_proj','k_proj','v_proj','o_proj'] will avoid loss always 0.0. But when I finetune 13B model, set lora_target_modules= ['q_proj','k_proj','v_proj','o_proj'] doesn't work.
I wonder how would things work when finetuning a 13B model
@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.
@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.
Yes, that's kind of the reason I tried to use LORA in the first place.
@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.
So is it a V100 problem? Could you Correct me if I'm wrong?
V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well?
I'm really confused here. Some help would be great~
I am starting another fine tuning cycle with 7B full model and it seems to work. At least the training loss didn’t go to zero. But I have to reduce the micro batch size otherwise it will oom. Not sure if I can fine tune on the 13B model this way.
will update again after I am done fine tuning and testing out the final model.
I use the fllowing script:
OMP_NUM_THREADS=8 WORLD_SIZE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
--master_port=9967 finetune.py \
--base_model '../llama-13b/' \
--data_path 'alpaca_data.json' \
--output_dir './lora-alpaca_1' \
--lora_target_modules ['q_proj','k_proj','v_proj','o_proj']
to run in 4x V100 GPU, I remove the following line in finetune.py:
model = prepare_model_for_int8_training(model)
I keep load_in_8bit=True,
.
Actually I have no idea what I'm doing.
but the loss now act strangly:
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.08}
3%|██▋ | 33/1170 [05:29<3:09:20, 9.99s/it]
{'loss': 0.262, 'learning_rate': 0.0, 'epoch': 0.1}
{'loss': 0.5197, 'learning_rate': 0.0, 'epoch': 0.13}
{'loss': 0.5263, 'learning_rate': 0.0, 'epoch': 0.15}
{'loss': 0.5203, 'learning_rate': 0.0, 'epoch': 0.18}
{'loss': 0.5364, 'learning_rate': 0.0, 'epoch': 0.2}
{'loss': 0.5147, 'learning_rate': 0.0, 'epoch': 0.23}
{'loss': 0.5437, 'learning_rate': 0.0, 'epoch': 0.26}
{'loss': 0.5367, 'learning_rate': 0.0, 'epoch': 0.28}
9%|████████▉ | 111/1170 [18:23<2:56:28, 10.00s/it]
It kind of have a nonzero value after some time.
But eval loss would still be nan. So the trick doesn't work.
I am starting another fine tuning cycle with 7B full model and it seems to work. At least the training loss didn’t go to zero. But I have to reduce the micro batch size otherwise it will oom. Not sure if I can fine tune on the 13B model this way.
will update again after I am done fine tuning and testing out the final model.
Hope you get it right on 13b, same issue.
I finally get it working on 4 x V100 , I remove the following line from finetune.py
model = prepare_model_for_int8_training(model)
set load_in_8bit=False
I do not use torchrun, whcih will cause OOM, instead I use the following command:
python3 finetune.py \
--base_model '../llama-13b/' \
--data_path 'alpaca_data.json' \
--output_dir './lora-alpaca_1' \
--lora_target_modules ['q_proj','k_proj','v_proj','o_proj']
It runs normally:
{'loss': 2.0273, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}
{'loss': 1.8993, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}
{'loss': 1.5317, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}
{'loss': 1.0961, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}
{'loss': 0.9039, 'learning_rate': 0.00015, 'epoch': 0.13}
{'loss': 0.8681, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}
{'loss': 0.8754, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}
{'loss': 0.8518, 'learning_rate': 0.00023999999999999998, 'epoch': 0.2}
{'loss': 0.8493, 'learning_rate': 0.00027, 'epoch': 0.23}
{'loss': 0.8473, 'learning_rate': 0.0003, 'epoch': 0.26}
{'loss': 0.8084, 'learning_rate': 0.00029719626168224294, 'epoch': 0.28}
{'loss': 0.8277, 'learning_rate': 0.00029439252336448596, 'epoch': 0.31}
{'loss': 0.8279, 'learning_rate': 0.0002915887850467289, 'epoch': 0.33}
{'loss': 0.8086, 'learning_rate': 0.00028878504672897194, 'epoch': 0.36}
{'loss': 0.8407, 'learning_rate': 0.0002859813084112149, 'epoch': 0.38}
{'loss': 0.8347, 'learning_rate': 0.0002831775700934579, 'epoch': 0.41}
{'loss': 0.8286, 'learning_rate': 0.0002803738317757009, 'epoch': 0.44}
{'loss': 0.8092, 'learning_rate': 0.0002775700934579439, 'epoch': 0.46}
{'loss': 0.8024, 'learning_rate': 0.00027476635514018686, 'epoch': 0.49}
{'loss': 0.8366, 'learning_rate': 0.0002719626168224299, 'epoch': 0.51}
{'eval_loss': 0.8226112127304077, 'eval_runtime': 170.546, 'eval_samples_per_second': 11.727, 'eval_steps_per_second': 1.466, 'epoch': 0
.51}
{'loss': 0.8278, 'learning_rate': 0.00026915887850467284, 'epoch': 0.54}
{'loss': 0.8237, 'learning_rate': 0.00026635514018691586, 'epoch': 0.56}
{'loss': 0.8114, 'learning_rate': 0.0002635514018691588, 'epoch': 0.59}
{'loss': 0.8224, 'learning_rate': 0.00026074766355140184, 'epoch': 0.61}
{'loss': 0.8139, 'learning_rate': 0.0002579439252336448, 'epoch': 0.64}
{'loss': 0.8273, 'learning_rate': 0.0002551401869158878, 'epoch': 0.67}
{'loss': 0.8185, 'learning_rate': 0.0002523364485981308, 'epoch': 0.69}
gpu is not looking good, most of the time, it only use like one GPU:
alpaca-lora $ nvidia-smi
Tue Apr 11 14:12:03 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 |
| N/A 54C P0 283W / 300W | 11052MiB / 32510MiB | 43% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 |
| N/A 54C P0 80W / 300W | 12736MiB / 32510MiB | 46% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 |
| N/A 51C P0 78W / 300W | 12736MiB / 32510MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 |
| N/A 57C P0 82W / 300W | 11828MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
but at a cost, without torchrun and int8, the training process would take 7h30min, which is not ideal. But at least training and val loss don't drop to zero.
I finally get it working on 4 x V100 , I remove the following line from finetune.py
model = prepare_model_for_int8_training(model)
set
load_in_8bit=False
I do not use torchrun, whcih will cause OOM, instead I use the following command:
python3 finetune.py \ --base_model '../llama-13b/' \ --data_path 'alpaca_data.json' \ --output_dir './lora-alpaca_1' \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']
It runs normally:
{'loss': 2.0273, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03} {'loss': 1.8993, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05} {'loss': 1.5317, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08} {'loss': 1.0961, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1} {'loss': 0.9039, 'learning_rate': 0.00015, 'epoch': 0.13} {'loss': 0.8681, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15} {'loss': 0.8754, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18} {'loss': 0.8518, 'learning_rate': 0.00023999999999999998, 'epoch': 0.2} {'loss': 0.8493, 'learning_rate': 0.00027, 'epoch': 0.23} {'loss': 0.8473, 'learning_rate': 0.0003, 'epoch': 0.26} {'loss': 0.8084, 'learning_rate': 0.00029719626168224294, 'epoch': 0.28} {'loss': 0.8277, 'learning_rate': 0.00029439252336448596, 'epoch': 0.31} {'loss': 0.8279, 'learning_rate': 0.0002915887850467289, 'epoch': 0.33} {'loss': 0.8086, 'learning_rate': 0.00028878504672897194, 'epoch': 0.36} {'loss': 0.8407, 'learning_rate': 0.0002859813084112149, 'epoch': 0.38} {'loss': 0.8347, 'learning_rate': 0.0002831775700934579, 'epoch': 0.41} {'loss': 0.8286, 'learning_rate': 0.0002803738317757009, 'epoch': 0.44} {'loss': 0.8092, 'learning_rate': 0.0002775700934579439, 'epoch': 0.46} {'loss': 0.8024, 'learning_rate': 0.00027476635514018686, 'epoch': 0.49} {'loss': 0.8366, 'learning_rate': 0.0002719626168224299, 'epoch': 0.51} {'eval_loss': 0.8226112127304077, 'eval_runtime': 170.546, 'eval_samples_per_second': 11.727, 'eval_steps_per_second': 1.466, 'epoch': 0 .51} {'loss': 0.8278, 'learning_rate': 0.00026915887850467284, 'epoch': 0.54} {'loss': 0.8237, 'learning_rate': 0.00026635514018691586, 'epoch': 0.56} {'loss': 0.8114, 'learning_rate': 0.0002635514018691588, 'epoch': 0.59} {'loss': 0.8224, 'learning_rate': 0.00026074766355140184, 'epoch': 0.61} {'loss': 0.8139, 'learning_rate': 0.0002579439252336448, 'epoch': 0.64} {'loss': 0.8273, 'learning_rate': 0.0002551401869158878, 'epoch': 0.67} {'loss': 0.8185, 'learning_rate': 0.0002523364485981308, 'epoch': 0.69}
gpu is not looking good, most of the time, it only use like one GPU:
alpaca-lora $ nvidia-smi Tue Apr 11 14:12:03 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 54C P0 283W / 300W | 11052MiB / 32510MiB | 43% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 | | N/A 54C P0 80W / 300W | 12736MiB / 32510MiB | 46% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 51C P0 78W / 300W | 12736MiB / 32510MiB | 5% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 57C P0 82W / 300W | 11828MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
but at a cost, without torchrun and int8, the training process would take 7h30min, which is not ideal. But at least training and val loss don't drop to zero.
@bupticybee I use your method run on 4 V100, It will cause error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
How can you run this scripts:
python3 finetune.py \ --base_model '../llama-13b/' \ --data_path 'alpaca_data.json' \ --output_dir './lora-alpaca_1' \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']
I finally get it working on 4 x V100 , I remove the following line from finetune.py
model = prepare_model_for_int8_training(model)
set
load_in_8bit=False
I do not use torchrun, whcih will cause OOM, instead I use the following command:python3 finetune.py \ --base_model '../llama-13b/' \ --data_path 'alpaca_data.json' \ --output_dir './lora-alpaca_1' \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']
It runs normally:
{'loss': 2.0273, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03} {'loss': 1.8993, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05} {'loss': 1.5317, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08} {'loss': 1.0961, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1} {'loss': 0.9039, 'learning_rate': 0.00015, 'epoch': 0.13} {'loss': 0.8681, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15} {'loss': 0.8754, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18} {'loss': 0.8518, 'learning_rate': 0.00023999999999999998, 'epoch': 0.2} {'loss': 0.8493, 'learning_rate': 0.00027, 'epoch': 0.23} {'loss': 0.8473, 'learning_rate': 0.0003, 'epoch': 0.26} {'loss': 0.8084, 'learning_rate': 0.00029719626168224294, 'epoch': 0.28} {'loss': 0.8277, 'learning_rate': 0.00029439252336448596, 'epoch': 0.31} {'loss': 0.8279, 'learning_rate': 0.0002915887850467289, 'epoch': 0.33} {'loss': 0.8086, 'learning_rate': 0.00028878504672897194, 'epoch': 0.36} {'loss': 0.8407, 'learning_rate': 0.0002859813084112149, 'epoch': 0.38} {'loss': 0.8347, 'learning_rate': 0.0002831775700934579, 'epoch': 0.41} {'loss': 0.8286, 'learning_rate': 0.0002803738317757009, 'epoch': 0.44} {'loss': 0.8092, 'learning_rate': 0.0002775700934579439, 'epoch': 0.46} {'loss': 0.8024, 'learning_rate': 0.00027476635514018686, 'epoch': 0.49} {'loss': 0.8366, 'learning_rate': 0.0002719626168224299, 'epoch': 0.51} {'eval_loss': 0.8226112127304077, 'eval_runtime': 170.546, 'eval_samples_per_second': 11.727, 'eval_steps_per_second': 1.466, 'epoch': 0 .51} {'loss': 0.8278, 'learning_rate': 0.00026915887850467284, 'epoch': 0.54} {'loss': 0.8237, 'learning_rate': 0.00026635514018691586, 'epoch': 0.56} {'loss': 0.8114, 'learning_rate': 0.0002635514018691588, 'epoch': 0.59} {'loss': 0.8224, 'learning_rate': 0.00026074766355140184, 'epoch': 0.61} {'loss': 0.8139, 'learning_rate': 0.0002579439252336448, 'epoch': 0.64} {'loss': 0.8273, 'learning_rate': 0.0002551401869158878, 'epoch': 0.67} {'loss': 0.8185, 'learning_rate': 0.0002523364485981308, 'epoch': 0.69}
gpu is not looking good, most of the time, it only use like one GPU:
alpaca-lora $ nvidia-smi Tue Apr 11 14:12:03 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 54C P0 283W / 300W | 11052MiB / 32510MiB | 43% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 | | N/A 54C P0 80W / 300W | 12736MiB / 32510MiB | 46% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 51C P0 78W / 300W | 12736MiB / 32510MiB | 5% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 57C P0 82W / 300W | 11828MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
but at a cost, without torchrun and int8, the training process would take 7h30min, which is not ideal. But at least training and val loss don't drop to zero.
@bupticybee I use your method run on 4 V100, It will cause error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
How can you run this scripts:
python3 finetune.py \ --base_model '../llama-13b/' \ --data_path 'alpaca_data.json' \ --output_dir './lora-alpaca_1' \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']
I commit everything I modify here: https://github.com/bupticybee/dark-lora . I really don't know what I did and whether it works. I havn't got the time to check the output model yet.
@bupticybee does gradio app is working for you on finetuned weights? For me, when I used finetuned model, the gradio app was stuck, and I was getting timeouts, when I was trying to reach out the application.
@bupticybee does gradio app is working for you on finetuned weights? For me, when I used finetuned model, the gradio app was stuck, and I was getting timeouts, when I was trying to reach out the application.
never used gradio, not sure.
@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.
So is it a V100 problem? Could you Correct me if I'm wrong?
V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well?
I'm really confused here. Some help would be great~
I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?
@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.
So is it a V100 problem? Could you Correct me if I'm wrong? V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well? I'm really confused here. Some help would be great~
I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?
I solve it by not using int8 training, but this slow down the training process by a lot.
@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.
So is it a V100 problem? Could you Correct me if I'm wrong? V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well? I'm really confused here. Some help would be great~
I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?
I solve it by not using int8 training, but this slow down the training process by a lot.
I found that using micro-bsz=1
and torchrun
can achieve acceptable speed. Directly using python finetune.py
will lead to DataParallel
instead of DistributedDataParallel
in pytorch, which will cause extremely slow speed and unbalanced GPU memory.
@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.
So is it a V100 problem? Could you Correct me if I'm wrong? V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well? I'm really confused here. Some help would be great~
I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?
I solve it by not using int8 training, but this slow down the training process by a lot.
Which V100 you use? I tried V100 16G, if not use int8, it will be OOM even use just 1 as micro_batch_size
I used 4 v100
This may be due to hardware reasons. On some hardware, the quantization model is not compatible with fp16. You can try set fp16=False.
I want to run 13B model finetuing, I use below scripts to run the code:
OMP_NUM_THREADS=8 WORLD_SIZE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \ --master_port=1234 finetune.py \ --base_model "./llama-13b-hf" \ --data_path './alpaca_data_cleaned.json' \ --output_dir './lora-alpaca-13b-multi-gpu' \ --batch_size 128 \ --micro_batch_size 8 \ --num_epochs 10 \ --cutoff_len 512 \ --val_set_size 0 \ --lora_r 16 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj'] \ --group_by_length \
But training loss always 0.0, What's the problems with my code?