tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.58k stars 2.21k forks source link

fintune 13B model, train_loss always 0.0 #288

Open LiuPearl1 opened 1 year ago

LiuPearl1 commented 1 year ago

I want to run 13B model finetuing, I use below scripts to run the code: OMP_NUM_THREADS=8 WORLD_SIZE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \ --master_port=1234 finetune.py \ --base_model "./llama-13b-hf" \ --data_path './alpaca_data_cleaned.json' \ --output_dir './lora-alpaca-13b-multi-gpu' \ --batch_size 128 \ --micro_batch_size 8 \ --num_epochs 10 \ --cutoff_len 512 \ --val_set_size 0 \ --lora_r 16 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj'] \ --group_by_length \

But training loss always 0.0, What's the problems with my code? image

AngainorDev commented 1 year ago

Apart from the --group_by_length param, it's very similair to how I run it, and your batch params seem consistent.

Loss at 0 definitely is weird.

I'd try clearing the .hf dataset cache, and maybe depending on your transformers version try the alternate llama model (you know, because of the tokenizer change recently, see https://github.com/tloen/alpaca-lora/issues/286#issuecomment-1499863545 )

LiuPearl1 commented 1 year ago

@AngainorDev When I run 7B model finetuning, training loss has converged. But 13B model finetuing doesn't work.

I have checked the transformer version, it's also 4.28.0.dev. image

Also I have seen someone uses V100 GPU met this problems. Someone guesses V100 can't support int8 training, but when I train 7B model on V100, int8 training can work. Based on this, I think it's not V100 causing error. Do you have any other infers?

keelezibel commented 1 year ago

I am getting the same issue int8 training on 7B model and the training loss drops to 0 as well. Does it mean for v100 gpu card we have to fine tune on the full model (I.e drop the load_in_8bit?) thanks!

bupticybee commented 1 year ago
截屏2023-04-11 10 33 27

finetuning on 4xV100 GPU ,I'm having the same issue here, train loss stay at 0.0 eval loss stay at nan, right from the start to the end.

LiuPearl1 commented 1 year ago

@keelezibel @bupticybee When I finetune 7B model, set lora_target_modules = ['q_proj','k_proj','v_proj','o_proj'] will avoid loss always 0.0. But when I finetune 13B model, set lora_target_modules= ['q_proj','k_proj','v_proj','o_proj'] doesn't work.

bupticybee commented 1 year ago

@keelezibel @bupticybee When I finetune 7B model, set lora_target_modules = ['q_proj','k_proj','v_proj','o_proj'] will avoid loss always 0.0. But when I finetune 13B model, set lora_target_modules= ['q_proj','k_proj','v_proj','o_proj'] doesn't work.

I wonder how would things work when finetuning a 13B model

LiuPearl1 commented 1 year ago

@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.

bupticybee commented 1 year ago

@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.

Yes, that's kind of the reason I tried to use LORA in the first place.

bupticybee commented 1 year ago

@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.

So is it a V100 problem? Could you Correct me if I'm wrong?

V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well?

I'm really confused here. Some help would be great~

keelezibel commented 1 year ago

I am starting another fine tuning cycle with 7B full model and it seems to work. At least the training loss didn’t go to zero. But I have to reduce the micro batch size otherwise it will oom. Not sure if I can fine tune on the 13B model this way.

will update again after I am done fine tuning and testing out the final model.

bupticybee commented 1 year ago

I use the fllowing script:

OMP_NUM_THREADS=8 WORLD_SIZE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    --master_port=9967 finetune.py \
    --base_model '../llama-13b/' \
    --data_path 'alpaca_data.json' \
    --output_dir './lora-alpaca_1' \
    --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']

to run in 4x V100 GPU, I remove the following line in finetune.py:

model = prepare_model_for_int8_training(model)

I keep load_in_8bit=True, .

Actually I have no idea what I'm doing.

but the loss now act strangly:

{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}                                                                                      
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}                                                                                      
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.08}                                                                                      
  3%|██▋                                                                                            | 33/1170 [05:29<3:09:20,  9.99s/it]
{'loss': 0.262, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                     
{'loss': 0.5197, 'learning_rate': 0.0, 'epoch': 0.13}                                                                                   
{'loss': 0.5263, 'learning_rate': 0.0, 'epoch': 0.15}                                                                                   
{'loss': 0.5203, 'learning_rate': 0.0, 'epoch': 0.18}                                                                                   
{'loss': 0.5364, 'learning_rate': 0.0, 'epoch': 0.2}                                                                                    
{'loss': 0.5147, 'learning_rate': 0.0, 'epoch': 0.23}                                                                                   
{'loss': 0.5437, 'learning_rate': 0.0, 'epoch': 0.26}                                                                                   
{'loss': 0.5367, 'learning_rate': 0.0, 'epoch': 0.28}                                                                                   
  9%|████████▉                                                                                     | 111/1170 [18:23<2:56:28, 10.00s/it]

It kind of have a nonzero value after some time.

But eval loss would still be nan. So the trick doesn't work.

bupticybee commented 1 year ago

I am starting another fine tuning cycle with 7B full model and it seems to work. At least the training loss didn’t go to zero. But I have to reduce the micro batch size otherwise it will oom. Not sure if I can fine tune on the 13B model this way.

will update again after I am done fine tuning and testing out the final model.

Hope you get it right on 13b, same issue.

bupticybee commented 1 year ago

I finally get it working on 4 x V100 , I remove the following line from finetune.py

model = prepare_model_for_int8_training(model)

set load_in_8bit=False

I do not use torchrun, whcih will cause OOM, instead I use the following command:

    python3 finetune.py \
    --base_model '../llama-13b/' \
    --data_path 'alpaca_data.json' \
    --output_dir './lora-alpaca_1' \
    --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']    

It runs normally:

{'loss': 2.0273, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}                                                                
{'loss': 1.8993, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}                                                                
{'loss': 1.5317, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}                                                                 
{'loss': 1.0961, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}                                                                 
{'loss': 0.9039, 'learning_rate': 0.00015, 'epoch': 0.13}                                                                               
{'loss': 0.8681, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}                                                                
{'loss': 0.8754, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}                                                                
{'loss': 0.8518, 'learning_rate': 0.00023999999999999998, 'epoch': 0.2}                                                                 
{'loss': 0.8493, 'learning_rate': 0.00027, 'epoch': 0.23}                                                                               
{'loss': 0.8473, 'learning_rate': 0.0003, 'epoch': 0.26}                                                                                
{'loss': 0.8084, 'learning_rate': 0.00029719626168224294, 'epoch': 0.28}                                                                
{'loss': 0.8277, 'learning_rate': 0.00029439252336448596, 'epoch': 0.31}                                                                
{'loss': 0.8279, 'learning_rate': 0.0002915887850467289, 'epoch': 0.33}                                                                 
{'loss': 0.8086, 'learning_rate': 0.00028878504672897194, 'epoch': 0.36}                                                                
{'loss': 0.8407, 'learning_rate': 0.0002859813084112149, 'epoch': 0.38}                                                                 
{'loss': 0.8347, 'learning_rate': 0.0002831775700934579, 'epoch': 0.41}                                                                 
{'loss': 0.8286, 'learning_rate': 0.0002803738317757009, 'epoch': 0.44}                                                                 
{'loss': 0.8092, 'learning_rate': 0.0002775700934579439, 'epoch': 0.46}                                                                 
{'loss': 0.8024, 'learning_rate': 0.00027476635514018686, 'epoch': 0.49}                                                                
{'loss': 0.8366, 'learning_rate': 0.0002719626168224299, 'epoch': 0.51}                                                                 
{'eval_loss': 0.8226112127304077, 'eval_runtime': 170.546, 'eval_samples_per_second': 11.727, 'eval_steps_per_second': 1.466, 'epoch': 0
.51}                                                                                                                                    
{'loss': 0.8278, 'learning_rate': 0.00026915887850467284, 'epoch': 0.54}                                                                
{'loss': 0.8237, 'learning_rate': 0.00026635514018691586, 'epoch': 0.56}                                                                
{'loss': 0.8114, 'learning_rate': 0.0002635514018691588, 'epoch': 0.59}                                                                 
{'loss': 0.8224, 'learning_rate': 0.00026074766355140184, 'epoch': 0.61}                                                                
{'loss': 0.8139, 'learning_rate': 0.0002579439252336448, 'epoch': 0.64}                                                                 
{'loss': 0.8273, 'learning_rate': 0.0002551401869158878, 'epoch': 0.67}                                                                 
{'loss': 0.8185, 'learning_rate': 0.0002523364485981308, 'epoch': 0.69}      

gpu is not looking good, most of the time, it only use like one GPU:

alpaca-lora $ nvidia-smi
Tue Apr 11 14:12:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   54C    P0   283W / 300W |  11052MiB / 32510MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   54C    P0    80W / 300W |  12736MiB / 32510MiB |     46%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   51C    P0    78W / 300W |  12736MiB / 32510MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   57C    P0    82W / 300W |  11828MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

but at a cost, without torchrun and int8, the training process would take 7h30min, which is not ideal. But at least training and val loss don't drop to zero.

LiuPearl1 commented 1 year ago

I finally get it working on 4 x V100 , I remove the following line from finetune.py

model = prepare_model_for_int8_training(model)

set load_in_8bit=False

I do not use torchrun, whcih will cause OOM, instead I use the following command:

    python3 finetune.py \
    --base_model '../llama-13b/' \
    --data_path 'alpaca_data.json' \
    --output_dir './lora-alpaca_1' \
    --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']    

It runs normally:

{'loss': 2.0273, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}                                                                
{'loss': 1.8993, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}                                                                
{'loss': 1.5317, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}                                                                 
{'loss': 1.0961, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}                                                                 
{'loss': 0.9039, 'learning_rate': 0.00015, 'epoch': 0.13}                                                                               
{'loss': 0.8681, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}                                                                
{'loss': 0.8754, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}                                                                
{'loss': 0.8518, 'learning_rate': 0.00023999999999999998, 'epoch': 0.2}                                                                 
{'loss': 0.8493, 'learning_rate': 0.00027, 'epoch': 0.23}                                                                               
{'loss': 0.8473, 'learning_rate': 0.0003, 'epoch': 0.26}                                                                                
{'loss': 0.8084, 'learning_rate': 0.00029719626168224294, 'epoch': 0.28}                                                                
{'loss': 0.8277, 'learning_rate': 0.00029439252336448596, 'epoch': 0.31}                                                                
{'loss': 0.8279, 'learning_rate': 0.0002915887850467289, 'epoch': 0.33}                                                                 
{'loss': 0.8086, 'learning_rate': 0.00028878504672897194, 'epoch': 0.36}                                                                
{'loss': 0.8407, 'learning_rate': 0.0002859813084112149, 'epoch': 0.38}                                                                 
{'loss': 0.8347, 'learning_rate': 0.0002831775700934579, 'epoch': 0.41}                                                                 
{'loss': 0.8286, 'learning_rate': 0.0002803738317757009, 'epoch': 0.44}                                                                 
{'loss': 0.8092, 'learning_rate': 0.0002775700934579439, 'epoch': 0.46}                                                                 
{'loss': 0.8024, 'learning_rate': 0.00027476635514018686, 'epoch': 0.49}                                                                
{'loss': 0.8366, 'learning_rate': 0.0002719626168224299, 'epoch': 0.51}                                                                 
{'eval_loss': 0.8226112127304077, 'eval_runtime': 170.546, 'eval_samples_per_second': 11.727, 'eval_steps_per_second': 1.466, 'epoch': 0
.51}                                                                                                                                    
{'loss': 0.8278, 'learning_rate': 0.00026915887850467284, 'epoch': 0.54}                                                                
{'loss': 0.8237, 'learning_rate': 0.00026635514018691586, 'epoch': 0.56}                                                                
{'loss': 0.8114, 'learning_rate': 0.0002635514018691588, 'epoch': 0.59}                                                                 
{'loss': 0.8224, 'learning_rate': 0.00026074766355140184, 'epoch': 0.61}                                                                
{'loss': 0.8139, 'learning_rate': 0.0002579439252336448, 'epoch': 0.64}                                                                 
{'loss': 0.8273, 'learning_rate': 0.0002551401869158878, 'epoch': 0.67}                                                                 
{'loss': 0.8185, 'learning_rate': 0.0002523364485981308, 'epoch': 0.69}      

gpu is not looking good, most of the time, it only use like one GPU:

alpaca-lora $ nvidia-smi
Tue Apr 11 14:12:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   54C    P0   283W / 300W |  11052MiB / 32510MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   54C    P0    80W / 300W |  12736MiB / 32510MiB |     46%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   51C    P0    78W / 300W |  12736MiB / 32510MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   57C    P0    82W / 300W |  11828MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

but at a cost, without torchrun and int8, the training process would take 7h30min, which is not ideal. But at least training and val loss don't drop to zero.

@bupticybee I use your method run on 4 V100, It will cause error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

How can you run this scripts: python3 finetune.py \ --base_model '../llama-13b/' \ --data_path 'alpaca_data.json' \ --output_dir './lora-alpaca_1' \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']

bupticybee commented 1 year ago

I finally get it working on 4 x V100 , I remove the following line from finetune.py

model = prepare_model_for_int8_training(model)

set load_in_8bit=False I do not use torchrun, whcih will cause OOM, instead I use the following command:

    python3 finetune.py \
    --base_model '../llama-13b/' \
    --data_path 'alpaca_data.json' \
    --output_dir './lora-alpaca_1' \
    --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']    

It runs normally:

{'loss': 2.0273, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}                                                                
{'loss': 1.8993, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}                                                                
{'loss': 1.5317, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}                                                                 
{'loss': 1.0961, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}                                                                 
{'loss': 0.9039, 'learning_rate': 0.00015, 'epoch': 0.13}                                                                               
{'loss': 0.8681, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}                                                                
{'loss': 0.8754, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}                                                                
{'loss': 0.8518, 'learning_rate': 0.00023999999999999998, 'epoch': 0.2}                                                                 
{'loss': 0.8493, 'learning_rate': 0.00027, 'epoch': 0.23}                                                                               
{'loss': 0.8473, 'learning_rate': 0.0003, 'epoch': 0.26}                                                                                
{'loss': 0.8084, 'learning_rate': 0.00029719626168224294, 'epoch': 0.28}                                                                
{'loss': 0.8277, 'learning_rate': 0.00029439252336448596, 'epoch': 0.31}                                                                
{'loss': 0.8279, 'learning_rate': 0.0002915887850467289, 'epoch': 0.33}                                                                 
{'loss': 0.8086, 'learning_rate': 0.00028878504672897194, 'epoch': 0.36}                                                                
{'loss': 0.8407, 'learning_rate': 0.0002859813084112149, 'epoch': 0.38}                                                                 
{'loss': 0.8347, 'learning_rate': 0.0002831775700934579, 'epoch': 0.41}                                                                 
{'loss': 0.8286, 'learning_rate': 0.0002803738317757009, 'epoch': 0.44}                                                                 
{'loss': 0.8092, 'learning_rate': 0.0002775700934579439, 'epoch': 0.46}                                                                 
{'loss': 0.8024, 'learning_rate': 0.00027476635514018686, 'epoch': 0.49}                                                                
{'loss': 0.8366, 'learning_rate': 0.0002719626168224299, 'epoch': 0.51}                                                                 
{'eval_loss': 0.8226112127304077, 'eval_runtime': 170.546, 'eval_samples_per_second': 11.727, 'eval_steps_per_second': 1.466, 'epoch': 0
.51}                                                                                                                                    
{'loss': 0.8278, 'learning_rate': 0.00026915887850467284, 'epoch': 0.54}                                                                
{'loss': 0.8237, 'learning_rate': 0.00026635514018691586, 'epoch': 0.56}                                                                
{'loss': 0.8114, 'learning_rate': 0.0002635514018691588, 'epoch': 0.59}                                                                 
{'loss': 0.8224, 'learning_rate': 0.00026074766355140184, 'epoch': 0.61}                                                                
{'loss': 0.8139, 'learning_rate': 0.0002579439252336448, 'epoch': 0.64}                                                                 
{'loss': 0.8273, 'learning_rate': 0.0002551401869158878, 'epoch': 0.67}                                                                 
{'loss': 0.8185, 'learning_rate': 0.0002523364485981308, 'epoch': 0.69}      

gpu is not looking good, most of the time, it only use like one GPU:

alpaca-lora $ nvidia-smi
Tue Apr 11 14:12:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   54C    P0   283W / 300W |  11052MiB / 32510MiB |     43%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   54C    P0    80W / 300W |  12736MiB / 32510MiB |     46%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   51C    P0    78W / 300W |  12736MiB / 32510MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   57C    P0    82W / 300W |  11828MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

but at a cost, without torchrun and int8, the training process would take 7h30min, which is not ideal. But at least training and val loss don't drop to zero.

@bupticybee I use your method run on 4 V100, It will cause error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

How can you run this scripts: python3 finetune.py \ --base_model '../llama-13b/' \ --data_path 'alpaca_data.json' \ --output_dir './lora-alpaca_1' \ --lora_target_modules ['q_proj','k_proj','v_proj','o_proj']

I commit everything I modify here: https://github.com/bupticybee/dark-lora . I really don't know what I did and whether it works. I havn't got the time to check the output model yet.

Avistian commented 1 year ago

@bupticybee does gradio app is working for you on finetuned weights? For me, when I used finetuned model, the gradio app was stuck, and I was getting timeouts, when I was trying to reach out the application.

bupticybee commented 1 year ago

@bupticybee does gradio app is working for you on finetuned weights? For me, when I used finetuned model, the gradio app was stuck, and I was getting timeouts, when I was trying to reach out the application.

never used gradio, not sure.

HillZhang1999 commented 1 year ago

@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.

So is it a V100 problem? Could you Correct me if I'm wrong?

V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well?

I'm really confused here. Some help would be great~

I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?

bupticybee commented 1 year ago

@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.

So is it a V100 problem? Could you Correct me if I'm wrong? V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well? I'm really confused here. Some help would be great~

I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?

I solve it by not using int8 training, but this slow down the training process by a lot.

HillZhang1999 commented 1 year ago

@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.

So is it a V100 problem? Could you Correct me if I'm wrong? V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well? I'm really confused here. Some help would be great~

I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?

I solve it by not using int8 training, but this slow down the training process by a lot.

I found that using micro-bsz=1 and torchrun can achieve acceptable speed. Directly using python finetune.py will lead to DataParallel instead of DistributedDataParallel in pytorch, which will cause extremely slow speed and unbalanced GPU memory.

xingenju commented 1 year ago

@bupticybee I think you should fine tune on the full model, but V100 will appear OOM.

So is it a V100 problem? Could you Correct me if I'm wrong? V100 doesn't support int8 training, so 13b+lora model would fail, but somehow it's able to train 7b+lora model. And 4090 can train 13b model as well? I'm really confused here. Some help would be great~

I also met the same question, and I am curious about why int8 can support 7b but fails to support 13b. Does anyone know the reason?

I solve it by not using int8 training, but this slow down the training process by a lot.

Which V100 you use? I tried V100 16G, if not use int8, it will be OOM even use just 1 as micro_batch_size

bupticybee commented 1 year ago

I used 4 v100

lyccyl1 commented 10 months ago

This may be due to hardware reasons. On some hardware, the quantization model is not compatible with fp16. You can try set fp16=False.