modelscope / ms-swift

Use PEFT or Full-parameter to finetune 300+ LLMs or 80+ MLLMs. (Qwen2, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V-2.6, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
3.41k stars 292 forks source link

GLM4V FineTune 效果没有任何提升 #1113

Closed ljch2018 closed 3 months ago

ljch2018 commented 3 months ago
image

https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/glm4v-best-practice.md

使用的是文档中的命令,eval_acc一直没有变化

Experimental environment: A100

40GB GPU memory

CUDA_VISIBLE_DEVICES=0 swift sft \ --model_type glm4v-9b-chat \ --dataset coco-en-2-mini

ljch2018 commented 3 months ago
Train:   4%|▍         | 100/2506 [06:59<2:18:47,  3.46s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 65.714, 'eval_samples_per_second': 6.163, 'eval_steps_per_second': 6.163, 'epoch': 0.02, 'global_step': 50}
Train:   6%|▌         | 150/2506 [10:57<2:12:51,  3.38s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.8879, 'eval_samples_per_second': 6.44, 'eval_steps_per_second': 6.44, 'epoch': 0.04, 'global_step': 100}
Train:   8%|▊         | 200/2506 [14:55<2:11:27,  3.42s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.7981, 'eval_samples_per_second': 6.449, 'eval_steps_per_second': 6.449, 'epoch': 0.06, 'global_step': 150}
Train:  10%|▉         | 250/2506 [18:52<2:08:15,  3.41s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.8016, 'eval_samples_per_second': 6.449, 'eval_steps_per_second': 6.449, 'epoch': 0.08, 'global_step': 200}
Train:  12%|█▏        | 300/2506 [22:49<2:04:41,  3.39s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.7774, 'eval_samples_per_second': 6.451, 'eval_steps_per_second': 6.451, 'epoch': 0.1, 'global_step': 250}
Train:  14%|█▍        | 350/2506 [26:51<2:05:18,  3.49s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 64.7775, 'eval_samples_per_second': 6.252, 'eval_steps_per_second': 6.252, 'epoch': 0.12, 'global_step': 300}
Train:  16%|█▌        | 400/2506 [30:50<2:00:56,  3.45s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 64.2418, 'eval_samples_per_second': 6.304, 'eval_steps_per_second': 6.304, 'epoch': 0.14, 'global_step': 350}
Train:  18%|█▊        | 450/2506 [34:48<1:56:03,  3.39s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.7747, 'eval_samples_per_second': 6.35, 'eval_steps_per_second': 6.35, 'epoch': 0.16, 'global_step': 400}
Train:  20%|█▉        | 500/2506 [38:49<1:54:25,  3.42s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.0112, 'eval_samples_per_second': 6.427, 'eval_steps_per_second': 6.427, 'epoch': 0.18, 'global_step': 450}
Train:  22%|██▏       | 550/2506 [42:55<1:53:31,  3.48s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 67.6752, 'eval_samples_per_second': 5.984, 'eval_steps_per_second': 5.984, 'epoch': 0.2, 'global_step': 500}
Train:  24%|██▍       | 600/2506 [46:56<1:49:00,  3.43s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 65.0543, 'eval_samples_per_second': 6.226, 'eval_steps_per_second': 6.226, 'epoch': 0.22, 'global_step': 550}
Train:  26%|██▌       | 650/2506 [50:52<1:45:32,  3.41s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.2141, 'eval_samples_per_second': 6.51, 'eval_steps_per_second': 6.51, 'epoch': 0.24, 'global_step': 600}
Train:  28%|██▊       | 700/2506 [54:49<1:46:19,  3.53s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 61.5351, 'eval_samples_per_second': 6.582, 'eval_steps_per_second': 6.582, 'epoch': 0.26, 'global_step': 650}
Train:  30%|██▉       | 750/2506 [58:50<1:44:32,  3.57s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.5144, 'eval_samples_per_second': 6.479, 'eval_steps_per_second': 6.479, 'epoch': 0.28, 'global_step': 700}
Train:  32%|███▏      | 800/2506 [1:02:52<1:43:04,  3.63s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.7197, 'eval_samples_per_second': 6.356, 'eval_steps_per_second': 6.356, 'epoch': 0.3, 'global_step': 750}
Train:  34%|███▍      | 850/2506 [1:06:51<1:33:54,  3.40s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.4718, 'eval_samples_per_second': 6.381, 'eval_steps_per_second': 6.381, 'epoch': 0.32, 'global_step': 800}
Train:  36%|███▌      | 900/2506 [1:10:49<1:32:51,  3.47s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.5162, 'eval_samples_per_second': 6.478, 'eval_steps_per_second': 6.478, 'epoch': 0.34, 'global_step': 850}
Train:  38%|███▊      | 950/2506 [1:14:48<1:29:08,  3.44s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.4845, 'eval_samples_per_second': 6.38, 'eval_steps_per_second': 6.38, 'epoch': 0.36, 'global_step': 900}
Train:  40%|███▉      | 1000/2506 [1:18:47<1:25:38,  3.41s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.4577, 'eval_samples_per_second': 6.382, 'eval_steps_per_second': 6.382, 'epoch': 0.38, 'global_step': 950}
Train:  42%|████▏     | 1050/2506 [1:22:48<1:23:46,  3.45s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 64.2354, 'eval_samples_per_second': 6.305, 'eval_steps_per_second': 6.305, 'epoch': 0.4, 'global_step': 1000}
Train:  44%|████▍     | 1100/2506 [1:26:48<1:20:58,  3.46s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.9206, 'eval_samples_per_second': 6.336, 'eval_steps_per_second': 6.336, 'epoch': 0.42, 'global_step': 1050}
Train:  46%|████▌     | 1150/2506 [1:30:49<1:18:15,  3.46s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.8943, 'eval_samples_per_second': 6.339, 'eval_steps_per_second': 6.339, 'epoch': 0.44, 'global_step': 1100}
Train:  48%|████▊     | 1200/2506 [1:34:51<1:14:31,  3.42s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 64.287, 'eval_samples_per_second': 6.3, 'eval_steps_per_second': 6.3, 'epoch': 0.46, 'global_step': 1150}
Train:  50%|████▉     | 1250/2506 [1:38:52<1:13:34,  3.51s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.4783, 'eval_samples_per_second': 6.38, 'eval_steps_per_second': 6.38, 'epoch': 0.48, 'global_step': 1200}
Train:  52%|█████▏    | 1300/2506 [1:42:50<1:11:03,  3.53s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.0242, 'eval_samples_per_second': 6.53, 'eval_steps_per_second': 6.53, 'epoch': 0.5, 'global_step': 1250}
Train:  54%|█████▍    | 1350/2506 [1:46:52<1:06:20,  3.44s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 63.4874, 'eval_samples_per_second': 6.379, 'eval_steps_per_second': 6.379, 'epoch': 0.52, 'global_step': 1300}
Train:  56%|█████▌    | 1400/2506 [1:50:52<1:03:32,  3.45s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 62.8762, 'eval_samples_per_second': 6.441, 'eval_steps_per_second': 6.441, 'epoch': 0.54, 'global_step': 1350}
Train:  58%|█████▊    | 1450/2506 [1:54:57<1:02:50,  3.57s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 64.1038, 'eval_samples_per_second': 6.318, 'eval_steps_per_second': 6.318, 'epoch': 0.56, 'global_step': 1400}
Train:  60%|█████▉    | 1500/2506 [1:59:05<59:40,  3.56s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 66.7413, 'eval_samples_per_second': 6.068, 'eval_steps_per_second': 6.068, 'epoch': 0.58, 'global_step': 1450}
Train:  62%|██████▏   | 1550/2506 [2:03:10<55:18,  3.47s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 66.4249, 'eval_samples_per_second': 6.097, 'eval_steps_per_second': 6.097, 'epoch': 0.6, 'global_step': 1500}
Train:  64%|██████▍   | 1600/2506 [2:07:13<52:50,  3.50s/it]{'eval_loss': 4.63580227, 'eval_acc': 0.29943393, 'eval_runtime': 64.9037, 'eval_samples_per_second': 6.24, 'eval_steps_per_second
Jintao-Huang commented 3 months ago

我这里是正常的, 你试试拉取一下最新的代码, main分支再试试

{"loss": 4.73242188, "acc": 0.28852928, "grad_norm": 3.015625, "learning_rate": 7.9e-07, "memory(GiB)": 26.69, "train_speed(iter/s)": 0.015415, "epoch": 0.00039901, "global_step": 1}
{"loss": 4.53930664, "acc": 0.27886444, "grad_norm": 2.96875, "learning_rate": 3.97e-06, "memory(GiB)": 26.75, "train_speed(iter/s)": 0.062339, "epoch": 0.00199506, "global_step": 5}
{"loss": 4.70488281, "acc": 0.30704112, "grad_norm": 3.21875, "learning_rate": 7.94e-06, "memory(GiB)": 26.76, "train_speed(iter/s)": 0.100074, "epoch": 0.00399012, "global_step": 10}
{"loss": 4.5796875, "acc": 0.29527118, "grad_norm": 3.125, "learning_rate": 1.19e-05, "memory(GiB)": 26.76, "train_speed(iter/s)": 0.125434, "epoch": 0.00598519, "global_step": 15}
{"loss": 4.48105469, "acc": 0.31034319, "grad_norm": 3.25, "learning_rate": 1.587e-05, "memory(GiB)": 26.75, "train_speed(iter/s)": 0.144182, "epoch": 0.00798025, "global_step": 20}
{"loss": 4.60878906, "acc": 0.2927319, "grad_norm": 3.96875, "learning_rate": 1.984e-05, "memory(GiB)": 26.75, "train_speed(iter/s)": 0.158645, "epoch": 0.00997531, "global_step": 25}
{"loss": 4.54609375, "acc": 0.30334492, "grad_norm": 3.71875, "learning_rate": 2.381e-05, "memory(GiB)": 26.75, "train_speed(iter/s)": 0.169621, "epoch": 0.01197037, "global_step": 30}
{"loss": 4.16171875, "acc": 0.30512872, "grad_norm": 3.96875, "learning_rate": 2.778e-05, "memory(GiB)": 26.8, "train_speed(iter/s)": 0.178711, "epoch": 0.01396544, "global_step": 35}
{"loss": 4.12519531, "acc": 0.30402834, "grad_norm": 3.390625, "learning_rate": 3.175e-05, "memory(GiB)": 26.78, "train_speed(iter/s)": 0.186231, "epoch": 0.0159605, "global_step": 40}
{"loss": 3.75292969, "acc": 0.34323542, "grad_norm": 3.71875, "learning_rate": 3.571e-05, "memory(GiB)": 26.78, "train_speed(iter/s)": 0.192634, "epoch": 0.01795556, "global_step": 45}
{"loss": 3.45537109, "acc": 0.38217406, "grad_norm": 4.53125, "learning_rate": 3.968e-05, "memory(GiB)": 26.78, "train_speed(iter/s)": 0.196665, "epoch": 0.01995062, "global_step": 50}
{"eval_loss": 3.35488033, "eval_acc": 0.38200273, "eval_runtime": 46.2627, "eval_samples_per_second": 8.754, "eval_steps_per_second": 8.754, "epoch": 0.01995062, "global_step": 50}
{"loss": 3.19121094, "acc": 0.37815187, "grad_norm": 4.40625, "learning_rate": 4.365e-05, "memory(GiB)": 26.92, "train_speed(iter/s)": 0.172221, "epoch": 0.02194568, "global_step": 55}
{"loss": 2.90654297, "acc": 0.43412285, "grad_norm": 2.78125, "learning_rate": 4.762e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.177257, "epoch": 0.02394075, "global_step": 60}
{"loss": 2.859375, "acc": 0.45457735, "grad_norm": 2.109375, "learning_rate": 5.159e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.181762, "epoch": 0.02593581, "global_step": 65}
{"loss": 2.95292969, "acc": 0.41753917, "grad_norm": 3.96875, "learning_rate": 5.556e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.186164, "epoch": 0.02793087, "global_step": 70}
{"loss": 2.75273438, "acc": 0.45084119, "grad_norm": 4.09375, "learning_rate": 5.952e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.190068, "epoch": 0.02992593, "global_step": 75}
{"loss": 2.72568359, "acc": 0.44446201, "grad_norm": 2.390625, "learning_rate": 6.349e-05, "memory(GiB)": 26.89, "train_speed(iter/s)": 0.193351, "epoch": 0.031921, "global_step": 80}
{"loss": 2.57392578, "acc": 0.48489127, "grad_norm": 2.609375, "learning_rate": 6.746e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.196415, "epoch": 0.03391606, "global_step": 85}
{"loss": 2.84160156, "acc": 0.43445544, "grad_norm": 3.53125, "learning_rate": 7.143e-05, "memory(GiB)": 26.89, "train_speed(iter/s)": 0.19929, "epoch": 0.03591112, "global_step": 90}
{"loss": 2.65341797, "acc": 0.49305367, "grad_norm": 2.171875, "learning_rate": 7.54e-05, "memory(GiB)": 26.89, "train_speed(iter/s)": 0.201682, "epoch": 0.03790618, "global_step": 95}
{"loss": 2.58632813, "acc": 0.4666573, "grad_norm": 2.796875, "learning_rate": 7.937e-05, "memory(GiB)": 26.89, "train_speed(iter/s)": 0.204158, "epoch": 0.03990124, "global_step": 100}
{"eval_loss": 2.6494019, "eval_acc": 0.47374585, "eval_runtime": 48.8501, "eval_samples_per_second": 8.291, "eval_steps_per_second": 8.291, "epoch": 0.03990124, "global_step": 100}
{"loss": 2.42949219, "acc": 0.49985466, "grad_norm": 2.375, "learning_rate": 8.333e-05, "memory(GiB)": 26.92, "train_speed(iter/s)": 0.187987, "epoch": 0.04189631, "global_step": 105}
{"loss": 2.82060547, "acc": 0.45991826, "grad_norm": 2.5625, "learning_rate": 8.73e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.190483, "epoch": 0.04389137, "global_step": 110}
{"loss": 2.73212891, "acc": 0.46720352, "grad_norm": 2.59375, "learning_rate": 9.127e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.192628, "epoch": 0.04588643, "global_step": 115}
{"loss": 2.66347656, "acc": 0.48635449, "grad_norm": 3.15625, "learning_rate": 9.524e-05, "memory(GiB)": 26.9, "train_speed(iter/s)": 0.194874, "epoch": 0.04788149, "global_step": 120}
{"loss": 2.62978516, "acc": 0.45872507, "grad_norm": 3.046875, "learning_rate": 9.921e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.196905, "epoch": 0.04987656, "global_step": 125}
{"loss": 2.60009766, "acc": 0.47549357, "grad_norm": 3.84375, "learning_rate": 9.983e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.198863, "epoch": 0.05187162, "global_step": 130}
{"loss": 2.50273438, "acc": 0.48703203, "grad_norm": 3.265625, "learning_rate": 9.962e-05, "memory(GiB)": 26.89, "train_speed(iter/s)": 0.200753, "epoch": 0.05386668, "global_step": 135}
{"loss": 2.65087891, "acc": 0.47238474, "grad_norm": 1.9375, "learning_rate": 9.941e-05, "memory(GiB)": 26.89, "train_speed(iter/s)": 0.202394, "epoch": 0.05586174, "global_step": 140}
{"loss": 2.54736328, "acc": 0.48591805, "grad_norm": 2.1875, "learning_rate": 9.92e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.204068, "epoch": 0.0578568, "global_step": 145}
{"loss": 2.55292969, "acc": 0.46213369, "grad_norm": 2.15625, "learning_rate": 9.899e-05, "memory(GiB)": 26.88, "train_speed(iter/s)": 0.205514, "epoch": 0.05985187, "global_step": 150}
ljch2018 commented 3 months ago

@Jintao-Huang 我跑的是最新的main分支的代码。我是在V100上跑的,会不会是GPU型号不支持。

Jintao-Huang commented 3 months ago

看看命令行日志中的可训练参数是多少

ljch2018 commented 3 months ago

@Jintao-Huang [INFO:swift] lora_target_modules: ['self_attention.query_key_value'] [INFO:swift] PeftModelForCausalLM: 13909.1062M Params (2.7853M Trainable [0.0200%]), 0.0000M Buffers.

我用的训练命令是:swift sft --model_id_or_path /record/llm_models/glm-4v-9b/ --model_type glm4v-9b-chat --dataset coco-en-2-mini glm-4v-9b的模型我是从huggingface上下载的,即GLM4官方的模型。不知道有没有影响。

ljch2018 commented 3 months ago

run sh: python /share/jchluo/swift/swift/cli/sft.py --model_id_or_path /record/llm_models/glm-4v-9b/ --model_type glm4v-9b-chat --dataset coco-en-2-mini [INFO:swift] Successfully registered /share/jchluo/swift/swift/llm/data/dataset_info.json [INFO:swift] Start time of running main: 2024-06-11 19:42:18.349730 [INFO:swift] Setting template_type: glm4v [INFO:swift] Setting args.lazy_tokenize: True [INFO:swift] Setting args.dataloader_num_workers: 1 [INFO:swift] output_dir: /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218 [INFO:swift] args: SftArguments(model_type='glm4v-9b-chat', model_id_or_path='/record/llm_models/glm-4v-9b', model_revision='master', sft_type='lora', freeze_parameters=0.0, additional_trainable_parameters=[], tuner_backend='peft', template_type='glm4v', output_dir='/share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218', add_output_dir_suffix=True, ddp_backend=None, ddp_find_unused_parameters=None, ddp_broadcast_buffers=None, seed=42, resume_from_checkpoint=None, ignore_data_skip=False, dtype='fp16', packing=False, dataset=['coco-en-2-mini'], val_dataset=[], dataset_seed=42, dataset_test_ratio=0.01, use_loss_scale=False, system=None, tools_prompt='react_en', max_length=2048, truncation_strategy='delete', check_dataset_strategy='none', model_name=[None, None], model_author=[None, None], quant_method=None, quantization_bit=0, hqq_axis=0, hqq_dynamic_config_path=None, bnb_4bit_comp_dtype='fp16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, lora_target_modules=['self_attention.query_key_value'], lora_rank=8, lora_alpha=32, lora_dropout_p=0.05, lora_bias_trainable='none', lora_modules_to_save=[], lora_dtype='AUTO', lora_lr_ratio=None, use_rslora=False, use_dora=False, init_lora_weights='true', rope_scaling=None, boft_block_size=4, boft_block_num=0, boft_n_butterfly_factor=1, boft_target_modules=['DEFAULT'], boft_dropout=0.0, boft_modules_to_save=[], vera_rank=256, vera_target_modules=['DEFAULT'], vera_projection_prng_key=0, vera_dropout=0.0, vera_d_initial=0.1, vera_modules_to_save=[], adapter_act='gelu', adapter_length=128, use_galore=False, galore_rank=128, galore_target_modules=None, galore_update_proj_gap=50, galore_scale=1.0, galore_proj_type='std', galore_optim_per_parameter=False, galore_with_embedding=False, adalora_target_r=8, adalora_init_r=12, adalora_tinit=0, adalora_tfinal=0, adalora_deltaT=1, adalora_beta1=0.85, adalora_beta2=0.85, adalora_orth_reg_weight=0.5, ia3_target_modules=['DEFAULT'], ia3_feedforward_modules=[], ia3_modules_to_save=[], llamapro_num_new_blocks=4, llamapro_num_groups=None, neftune_noise_alpha=None, neftune_backend='transformers', lisa_activated_layers=0, lisa_step_interval=20, gradient_checkpointing=True, deepspeed=None, batch_size=1, eval_batch_size=1, num_train_epochs=1, max_steps=-1, optim='adamw_torch', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, learning_rate=0.0001, weight_decay=0.1, gradient_accumulation_steps=16, max_grad_norm=0.5, predict_with_generate=False, lr_scheduler_type='linear', warmup_ratio=0.05, eval_steps=50, save_steps=50, save_only_model=False, save_total_limit=2, logging_steps=5, dataloader_num_workers=1, dataloader_pin_memory=True, dataloader_drop_last=False, push_to_hub=False, hub_model_id=None, hub_token=None, hub_private_repo=False, push_hub_strategy='push_best', test_oom_error=False, disable_tqdm=False, lazy_tokenize=True, preprocess_num_proc=1, use_flash_attn=None, ignore_args_error=False, check_model_is_latest=True, logging_dir='/share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/runs', report_to=['tensorboard'], acc_strategy='token', save_on_each_node=True, evaluation_strategy='steps', save_strategy='steps', save_safetensors=True, gpu_memory_fraction=None, include_num_input_tokens_seen=False, local_repo_path=None, custom_register_path=None, custom_dataset_info=None, device_map_config_path=None, max_new_tokens=2048, do_sample=True, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.0, num_beams=1, fsdp='', fsdp_config=None, sequence_parallel_size=1, model_layer_cls_name=None, metric_warmup_step=0, fsdp_num=1, per_device_train_batch_size=None, per_device_eval_batch_size=None, eval_strategy=None, self_cognition_sample=0, train_dataset_mix_ratio=0.0, train_dataset_mix_ds=['ms-bench'], train_dataset_sample=-1, val_dataset_sample=None, safe_serialization=None, only_save_model=None, neftune_alpha=None, deepspeed_config_path=None, model_cache_dir=None, custom_train_dataset_path=[], custom_val_dataset_path=[]) [INFO:swift] Global seed set to 42 [INFO:swift] Loading the model using model_dir: /record/llm_models/glm-4v-9b [INFO:swift] model.max_model_len: 8192 [INFO:swift] model_config: ChatGLMConfig { [INFO:swift] generation_config: GenerationConfig { [INFO:swift] lora_target_modules: ['self_attention.query_key_value'] [INFO:swift] lora_modules_to_save: [] [INFO:swift] lora_config: get_wrapped_class..PeftWrapper(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='/record/llm_models/glm-4v-9b', revision=None, task_type='CAUSAL_LM', inference_mode=False, r=8, target_modules={'self_attention.query_key_value'}, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=[], init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None, lora_dtype=None, lorap_lr_ratio=None, lorap_emb_lr=1e-06) [INFO:swift] Convert trainable parameters from fp16 to fp32. [INFO:swift] [base_model.model.transformer.embedding.word_embeddings.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.input_layernorm.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.base_layer.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.base_layer.bias]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_A.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_B.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.self_attention.dense.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.post_attention_layernorm.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.input_layernorm.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.base_layer.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.base_layer.bias]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_A.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_B.default.weight]: requires_grad=True, dtype=torch.float32, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.self_attention.dense.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.post_attention_layernorm.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] [base_model.model.transformer.encoder.layers.2.input_layernorm.weight]: requires_grad=False, dtype=torch.float16, device=cuda:0 [INFO:swift] ... [INFO:swift] PeftModelForCausalLM( [INFO:swift] PeftModelForCausalLM: 13909.1062M Params (2.7853M Trainable [0.0200%]), 0.0000M Buffers. [INFO:swift] Setting model.config.use_cache: False [INFO:swift] Downloading the dataset from ModelScope, dataset_id: modelscope/coco_2014_caption [INFO:swift] train_dataset: Dataset({ [INFO:swift] val_dataset: Dataset({ [INFO:swift] system: None [INFO:swift] args.lazy_tokenize: True [INFO:swift] [INPUT_IDS] [151331, 151333, 151336, 198, 151339, 151329, 151340, 29904, 7512, 279, 2168, 13, 151337, 32, 2613, 14841, 42449, 448, 264, 26133, 323, 10718, 14511, 151329] [INFO:swift] [INPUT] [gMASK] <|user|> [INFO:swift] [LABLES_IDS] [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 32, 2613, 14841, 42449, 448, 264, 26133, 323, 10718, 14511, 151329] [INFO:swift] [LABLES] [-100 * 13]A small bathroom stall with a toilet and seat covers <|endoftext|> [INFO:swift] training_args: Seq2SeqTrainingArguments( logging_dir=/share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/runs, output_dir=/share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218, run_name=/share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218, [INFO:swift] The SftArguments will be saved in: /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/sft_args.json [INFO:swift] The Seq2SeqTrainingArguments will be saved in: /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/training_args.json [INFO:swift] The logging file will be saved in: /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/logging.jsonl [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-50 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-100 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-150 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-200 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-250 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-300 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-350 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-400 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-450 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-500 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-550 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-600 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-650 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-700 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-750 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-800 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-850 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-900 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-950 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1000 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1050 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1100 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1150 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1200 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1250 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1300 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1350 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1400 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1450 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1500 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1550 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1600 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1650 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1700 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1750 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1800 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1850 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1900 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-1950 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2000 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2050 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2100 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2150 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2200 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2250 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2300 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2350 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2400 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2450 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2500 [INFO:swift] Saving model checkpoint to /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2506 [INFO:swift] last_model_checkpoint: /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-2506 [INFO:swift] best_model_checkpoint: /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/checkpoint-50 [INFO:swift] images_dir: /share/jchluo/swift/output/glm4v-9b-chat/v3-20240611-194218/images [INFO:swift] End time of running main: 2024-06-11 23:03:09.141456

ljch2018 commented 3 months ago

应该是V100显卡的问题,我换了A100就可以了。 @Jintao-Huang

Jintao-Huang commented 3 months ago

估计是 不能使用fp16的原因, nan了.

https://github.com/modelscope/swift/issues/1099

chensongcan commented 2 months ago

可以拉最新的代码,主要是之前的代码训练的时候没有传图片,只训了文本数据