ssbuild / chatglm_finetuning

chatglm 6b finetuning and alpaca finetuning
1.53k stars 175 forks source link

代码有问题,开启lora ,关闭deepspeed ,会默认开启多卡训练,而当训练到达一定程度时,loss=nan #104

Closed wtypty2 closed 1 year ago

wtypty2 commented 1 year ago

输出如下 Epoch 04383: adjusting learning rate of group 0 to 9.9171e-06. Epoch 04383: adjusting learning rate of group 1 to 9.9171e-06. Epoch 04383: adjusting learning rate of group 0 to 9.9171e-06. Epoch 04383: adjusting learning rate of group 1 to 9.9171e-06. Epoch 1: 1%| | 23/4360 [00:32<1:43:33, 1.43s/it, loss=nan, v_num=0]Epoch 04383: adjusting learning rate of group 0 to 9.9171e-06. Epoch 04383: adjusting learning rate of group 1 to 9.9171e-06. Epoch 04384: adjusting learning rate of group 0 to 9.9135e-06. Epoch 04384: adjusting learning rate of group 1 to 9.9135e-06. Epoch 1: 1%| | 24/4360 [00:35<1:46:07, 1.47s/it, loss=nan, v_num=0]Epoch 04384: adjusting learning rate of group 0 to 9.9135e-06. Epoch 04384: adjusting learning rate of group 1 to 9.9135e-06. Epoch 1: 1%| | 24/4360 [00:35<1:46:07, 1.47s/it, loss=nan, v_num=0]Epoch 04384: adjusting learning rate of group 0 to 9.9135e-06. Epoch 04384: adjusting learning rate of group 1 to 9.9135e-06. Epoch 04385: adjusting learning rate of group 0 to 9.9099e-06. Epoch 04385: adjusting learning rate of group 1 to 9.9099e-06. Epoch 04385: adjusting learning rate of group 0 to 9.9099e-06. Epoch 04385: adjusting learning rate of group 1 to 9.9099e-06. Epoch 1: 1%| | 25/4360 [00:36<1:45:42, 1.46s/it, loss=nan, v_num=0]Epoch 04385: adjusting learning rate of group 0 to 9.9099e-06. Epoch 04385: adjusting learning rate of group 1 to 9.9099e-06.

wtypty2 commented 1 year ago

已经测试过不同的训练层数,以及不同的scheduler,问题均会出现。loss会从9.xx突然掉到nan Epoch 0: 87%|████████▋ | 3792/4360 [1:36:52<14:30, 1.53s/it, loss=nan, v_num=0]Epoch 03792: adjusting learning rate of group 0 to 1.2032e-05. Epoch 03792: adjusting learning rate of group 1 to 1.2032e-05. Epoch 03793: adjusting learning rate of group 0 to 1.2029e-05. Epoch 03793: adjusting learning rate of group 1 to 1.2029e-05. Epoch 03793: adjusting learning rate of group 0 to 1.2029e-05. Epoch 03793: adjusting learning rate of group 1 to 1.2029e-05. Epoch 0: 87%|████████▋ | 3793/4360 [1:36:54<14:29, 1.53s/it, loss=nan, v_num=0]Epoch 03793: adjusting learning rate of group 0 to 1.2029e-05. Epoch 03793: adjusting learning rate of group 1 to 1.2029e-05. Epoch 03794: adjusting learning rate of group 0 to 1.2025e-05. Epoch 03794: adjusting learning rate of group 1 to 1.2025e-05. Epoch 03794: adjusting learning rate of group 0 to 1.2025e-05. Epoch 03794: adjusting learning rate of group 1 to 1.2025e-05. Epoch 03794: adjusting learning rate of group 0 to 1.2025e-05. Epoch 03794: adjusting learning rate of group 1 to 1.2025e-05. Epoch 0: 87%|████████▋ | 3794/4360 [1:36:55<14:27, 1.53s/it, loss=nan, v_num=0]Epoch 03795: adjusting learning rate of group 0 to 1.2022e-05. Epoch 03795: adjusting learning rate of group 1 to 1.2022e-05. Epoch 03795: adjusting learning rate of group 0 to 1.2022e-05. Epoch 03795: adjusting learning rate of group 1 to 1.2022e-05. Epoch 03795: adjusting learning rate of group 0 to 1.2022e-05. Epoch 03795: adjusting learning rate of group 1 to 1.2022e-05. Epoch 0: 87%|████████▋ | 3795/4360 [1:36:57<14:26, 1.53s/it, loss=9.13, v_num=0]Epoch 03796: adjusting learning rate of group 0 to 1.2018e-05. Epoch 03796: adjusting learning rate of group 1 to 1.2018e-05. Epoch 03796: adjusting learning rate of group 0 to 1.2018e-05. Epoch 03796: adjusting learning rate of group 1 to 1.2018e-05. Epoch 0: 87%|████████▋ | 3796/4360 [1:36:59<14:24, 1.53s/it, loss=9.12, v_num=0]Epoch 03796: adjusting learning rate of group 0 to 1.2018e-05. Epoch 03796: adjusting learning rate of group 1 to 1.2018e-05. Epoch 03797: adjusting learning rate of group 0 to 1.2014e-05. Epoch 03797: adjusting learning rate of group 1 to 1.2014e-05. Epoch 03797: adjusting learning rate of group 0 to 1.2014e-05. Epoch 03797: adjusting learning rate of group 1 to 1.2014e-05. Epoch 0: 87%|████████▋ | 3797/4360 [1:37:00<14:23, 1.53s/it, loss=9.12, v_num=0]Epoch 03797: adjusting learning rate of group 0 to 1.2014e-05. Epoch 03797: adjusting learning rate of group 1 to 1.2014e-05. Epoch 0: 87%|████████▋ | 3797/4360 [1:37:00<14:23, 1.53s/it, loss=9.07, v_num=0]Epoch 03798: adjusting learning rate of group 0 to 1.2011e-05. Epoch 03798: adjusting learning rate of group 1 to 1.2011e-05.Epoch 03798: adjusting learning rate of group 0 to 1.2011e-05.

Epoch 03798: adjusting learning rate of group 1 to 1.2011e-05. Epoch 0: 87%|████████▋ | 3798/4360 [1:37:01<14:21, 1.53s/it, loss=9.07, v_num=0]Epoch 03798: adjusting learning rate of group 0 to 1.2011e-05. Epoch 03798: adjusting learning rate of group 1 to 1.2011e-05. Epoch 03799: adjusting learning rate of group 0 to 1.2007e-05. Epoch 03799: adjusting learning rate of group 1 to 1.2007e-05. Epoch 03799: adjusting learning rate of group 0 to 1.2007e-05. Epoch 03799: adjusting learning rate of group 1 to 1.2007e-05. Epoch 03799: adjusting learning rate of group 0 to 1.2007e-05. Epoch 03799: adjusting learning rate of group 1 to 1.2007e-05. Epoch 0: 87%|████████▋ | 3799/4360 [1:37:03<14:19, 1.53s/it, loss=9.09, v_num=0]Epoch 03800: adjusting learning rate of group 0 to 1.2004e-05. Epoch 03800: adjusting learning rate of group 1 to 1.2004e-05. Epoch 03800: adjusting learning rate of group 0 to 1.2004e-05. Epoch 03800: adjusting learning rate of group 1 to 1.2004e-05. Epoch 03800: adjusting learning rate of group 0 to 1.2004e-05. Epoch 03800: adjusting learning rate of group 1 to 1.2004e-05. Epoch 0: 87%|████████▋ | 3800/4360 [1:37:04<14:18, 1.53s/it, loss=9.12, v_num=0]Epoch 03801: adjusting learning rate of group 0 to 1.2000e-05. Epoch 03801: adjusting learning rate of group 1 to 1.2000e-05. Epoch 03801: adjusting learning rate of group 0 to 1.2000e-05. Epoch 03801: adjusting learning rate of group 1 to 1.2000e-05. Epoch 0: 87%|████████▋ | 3801/4360 [1:37:06<14:16, 1.53s/it, loss=9.12, v_num=0]Epoch 03801: adjusting learning rate of group 0 to 1.2000e-05. Epoch 03801: adjusting learning rate of group 1 to 1.2000e-05. Epoch 03802: adjusting learning rate of group 0 to 1.1997e-05. Epoch 03802: adjusting learning rate of group 1 to 1.1997e-05. Epoch 03802: adjusting learning rate of group 0 to 1.1997e-05. Epoch 03802: adjusting learning rate of group 1 to 1.1997e-05. Epoch 0: 87%|████████▋ | 3802/4360 [1:37:08<14:15, 1.53s/it, loss=9.12, v_num=0]Epoch 03802: adjusting learning rate of group 0 to 1.1997e-05. Epoch 03802: adjusting learning rate of group 1 to 1.1997e-05. Epoch 0: 87%|████████▋ | 3802/4360 [1:37:08<14:15, 1.53s/it, loss=9.04, v_num=0]Epoch 03803: adjusting learning rate of group 0 to 1.1993e-05. Epoch 03803: adjusting learning rate of group 1 to 1.1993e-05. Epoch 03803: adjusting learning rate of group 0 to 1.1993e-05. Epoch 03803: adjusting learning rate of group 1 to 1.1993e-05. Epoch 0: 87%|████████▋ | 3803/4360 [1:37:10<14:13, 1.53s/it, loss=9.16, v_num=0]Epoch 03803: adjusting learning rate of group 0 to 1.1993e-05. Epoch 03803: adjusting learning rate of group 1 to 1.1993e-05. Epoch 03804: adjusting learning rate of group 0 to 1.1990e-05. Epoch 03804: adjusting learning rate of group 1 to 1.1990e-05. Epoch 03804: adjusting learning rate of group 0 to 1.1990e-05. Epoch 03804: adjusting learning rate of group 1 to 1.1990e-05. Epoch 03804: adjusting learning rate of group 0 to 1.1990e-05. Epoch 0: 87%|████████▋ | 3804/4360 [1:37:11<14:12, 1.53s/it, loss=9.16, v_num=0]Epoch 03804: adjusting learning rate of group 1 to 1.1990e-05. Epoch 0: 87%|████████▋ | 3804/4360 [1:37:11<14:12, 1.53s/it, loss=9.26, v_num=0]Epoch 03805: adjusting learning rate of group 0 to 1.1986e-05. Epoch 03805: adjusting learning rate of group 1 to 1.1986e-05. Epoch 03805: adjusting learning rate of group 0 to 1.1986e-05. Epoch 03805: adjusting learning rate of group 1 to 1.1986e-05. Epoch 03805: adjusting learning rate of group 0 to 1.1986e-05. Epoch 03805: adjusting learning rate of group 1 to 1.1986e-05. Epoch 0: 87%|████████▋ | 3805/4360 [1:37:14<14:10, 1.53s/it, loss=9.27, v_num=0]Epoch 03806: adjusting learning rate of group 0 to 1.1983e-05. Epoch 03806: adjusting learning rate of group 1 to 1.1983e-05. Epoch 03806: adjusting learning rate of group 0 to 1.1983e-05. Epoch 03806: adjusting learning rate of group 1 to 1.1983e-05. Epoch 0: 87%|████████▋ | 3806/4360 [1:37:15<14:09, 1.53s/it, loss=nan, v_num=0] Epoch 03806: adjusting learning rate of group 0 to 1.1983e-05. Epoch 03806: adjusting learning rate of group 1 to 1.1983e-05. Epoch 03807: adjusting learning rate of group 0 to 1.1979e-05. Epoch 03807: adjusting learning rate of group 0 to 1.1979e-05. Epoch 03807: adjusting learning rate of group 1 to 1.1979e-05. Epoch 03807: adjusting learning rate of group 1 to 1.1979e-05. Epoch 0: 87%|████████▋ | 3807/4360 [1:37:17<14:07, 1.53s/it, loss=nan, v_num=0]Epoch 03807: adjusting learning rate of group 0 to 1.1979e-05. Epoch 03807: adjusting learning rate of group 1 to 1.1979e-05. Epoch 03808: adjusting learning rate of group 0 to 1.1976e-05.Epoch 03808: adjusting learning rate of group 0 to 1.1976e-05.

Epoch 03808: adjusting learning rate of group 1 to 1.1976e-05.Epoch 03808: adjusting learning rate of group 1 to 1.1976e-05.

Epoch 0: 87%|████████▋ | 3808/4360 [1:37:18<14:06, 1.53s/it, loss=nan, v_num=0]Epoch 03808: adjusting learning rate of group 0 to 1.1976e-05. Epoch 03808: adjusting learning rate of group 1 to 1.1976e-05. Epoch 0: 87%|████████▋ | 3808/4360 [1:37:18<14:06, 1.53s/it, loss=nan, v_num=0]Epoch 03809: adjusting learning rate of group 0 to 1.1972e-05. Epoch 03809: adjusting learning rate of group 1 to 1.1972e-05. Epoch 03809: adjusting learning rate of group 0 to 1.1972e-05.

wtypty2 commented 1 year ago

配置信息如下:

config.json

{ "architectures": [ "ChatGLMModel" ], "auto_map": { "AutoConfig": "configuration_chatglm.ChatGLMConfig", "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration", "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration" }, "bos_token_id": 150004, "eos_token_id": 150005, "hidden_size": 4096, "inner_hidden_size": 16384, "layernorm_epsilon": 1e-05, "max_sequence_length": 2048, "model_type": "chatglm", "num_attention_heads": 32, "num_layers": 8, "position_encoding_2d": true, "torch_dtype": "float16", "transformers_version": "4.23.1", "use_cache": true, "vocab_size": 150528, "precision": 16 }

train_info_args = {

'devices': 3,
'data_backend': 'record',
'model_type': 'chatglm',
# 预训练模型路径 , 从0训练,则置空
'model_name_or_path': '/mnt/d/MD/models--THUDM--chatglm-6b/snapshots/4a9b711e61d62b64ae8a07d763553a98a984d281',
'config_name': './config/config_small.json',
'tokenizer_name': '/mnt/d/Tokenizer/models--THUDM--chatglm-6b/snapshots/4a9b711e61d62b64ae8a07d763553a98a984d281',
'convert_onnx': False, # 转换onnx模型
'do_train': True,
'train_file':  [ './data/guanaco.json'],
'max_epochs': 5,
'max_steps': -1,
'optimizer': 'lion', # one of adamw,adam,lamb,lion

# 'scheduler_type': 'linear',# one of [linear,WarmupCosine,CAWR,CAL,Step,ReduceLROnPlateau
# 'scheduler': None,

# 切换scheduler类型
# 'scheduler_type': 'WarmupCosine',
# 'scheduler': None,

# 'scheduler_type': 'ReduceLROnPlateau',
# 'scheduler': None,

# 'scheduler_type': 'Step',
# 'scheduler':{ 'decay_rate': 0.999,'decay_steps': 100,'verbose': True},

'scheduler_type': 'CAWR',
'scheduler':{'T_mult': 1, 'rewarm_epoch_num': 2, 'verbose': True},

# 'scheduler_type': 'CAL',
# 'scheduler': {'rewarm_epoch_num': 2,'verbose': True},

'optimizer_betas': (0.9, 0.999),
'train_batch_size': 4,
'eval_batch_size': 2,
'test_batch_size': 2,
'learning_rate': 2e-5,  #
'adam_epsilon': 1e-8,
'gradient_accumulation_steps': 1,
'max_grad_norm': 1.0,
'weight_decay': 0,
'warmup_steps': 0,
'output_dir': './output',
'max_seq_length': 768, # 如果资源充足,推荐长度2048 与官方保持一致
'max_target_length': 100,  # 预测最大长度, 保留字段
'use_fast_tokenizer': False,
'do_lower_case': False,

##############  lora模块
'with_lora': True,  # 是否启用lora模块
'inference_mode': False, # 推理模型, 不需要手动设置
'r': 8,
'target_modules': ['query_key_value'],
'target_dtype': '16',
'lora_alpha': 32,
# 'enable_lora': [True],
'enable_lora': None,
'lora_dropout': 0.1,
'bias': 'none',  # Bias type for Lora. Can be 'none', 'all' or 'lora_only'"

}

lora 模式暂时不支持deepspeed

enable_deepspeed = False

wtypty2 commented 1 year ago

采用最新版本代码,pytorch 2.0 with cuda11.7

ssbuild commented 1 year ago

采用最新版本代码,pytorch 2.0 with cuda11.7

lora 要全参数训练 ,已经测试过,如果还有必要reopen it.

cristianohello commented 1 year ago

@ssbuild

lora 要全参数训练,在哪设置?

ssbuild commented 1 year ago

@ssbuild

lora 要全参数训练,在哪设置?

看一下readme训练章节,全层