Help！训练结束加载大模型，使用训练数据集中的问题向大模型提问，答案不是训练数据集中的答案。训练耗时不足2分钟，怀疑训练失败，请大佬们帮忙诊断一下！！

muliu commented 3 days ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

LLaMA Factory, version 0.8.3.dev0 Ubuntu: 20.04 Linux kernel: 5.11.0-46 GPU: 3090 CPU: 10核内存：56G python: 3.11.7 cuda: 12.2

Reproduction

使用自定义数据集，只有184 组对话。使用webui界面操作进行训练，训练过程耗时不到 2 分钟。

命令行： llamafactory-cli train \ --stage sft \ --do_train True \ --model_name_or_path /home/jovyan/fast-data/models/Qwen1.5-7B-chat/qwen/Qwen1___5-7B-chat \ --preprocessing_num_workers 16 \ --finetuning_type lora \ --template qwen \ --flash_attn auto \ --dataset_dir data \ --dataset tcs_data \ --cutoff_len 1024 \ --learning_rate 5e-05 \ --num_train_epochs 3.0 \ --max_samples 100000 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --lr_scheduler_type cosine \ --max_grad_norm 1.0 \ --logging_steps 5 \ --save_steps 100 \ --warmup_steps 0 \ --optim adamw_torch \ --packing False \ --report_to none \ --output_dir saves/Qwen1.5-7B-Chat/lora/train_2024-06-26-07-12-22 \ --fp16 True \ --plot_loss True \ --ddp_timeout 180000000 \ --include_num_input_tokens_seen True \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0 \ --lora_target all

Shell输出： root@m-tcs-power-0-0:/home/jovyan/fast-data/models/LLaMA-Factory# llamafactory-cli webui Running on local URL: http://0.0.0.0:7861

To create a public link, set share=True in launch(). 06/26/2024 07:13:48 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16 [INFO|tokenization_utils_base.py:2106] 2024-06-26 07:13:48,046 >> loading file vocab.json [INFO|tokenization_utils_base.py:2106] 2024-06-26 07:13:48,046 >> loading file merges.txt [INFO|tokenization_utils_base.py:2106] 2024-06-26 07:13:48,046 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2106] 2024-06-26 07:13:48,046 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2106] 2024-06-26 07:13:48,046 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2106] 2024-06-26 07:13:48,046 >> loading file tokenizer_config.json [WARNING|logging.py:314] 2024-06-26 07:13:48,269 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06/26/2024 07:13:48 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> 06/26/2024 07:13:48 - INFO - llamafactory.data.loader - Loading dataset TCS_data_epp.json... input_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14880, 109432, 18, 21, 15, 104992, 99464, 108635, 1773, 151645, 198, 151644, 77091, 198, 18, 21, 15, 44054, 230, 78882, 99464, 108635, 103951, 101219, 220, 18, 21, 15, 41479, 231, 35987, 105675, 99226, 99473, 107096, 16872, 3837, 23031, 104315, 5373, 107692, 5373, 104455, 49567, 106571, 17714, 104069, 3837, 23031, 101650, 47874, 17714, 101048, 3837, 42067, 99287, 101320, 5373, 104992, 99464, 105215, 101034, 104992, 100168, 113308, 103394, 104587, 52334, 99464, 82700, 8997, 21894, 82700, 67338, 99287, 101320, 5373, 99622, 101142, 39352, 57218, 108298, 104749, 5373, 109173, 5373, 104992, 105215, 5373, 110150, 100359, 5373, 104992, 101978, 102808, 1479, 49, 111233, 3837, 106908, 104992, 104652, 107193, 106552, 106881, 104238, 5373, 104652, 102146, 9370, 32876, 99733, 99788, 5373, 104652, 101098, 9370, 104776, 99788, 3837, 88086, 113917, 100169, 75405, 99313, 101320, 33108, 104112, 105204, 3837, 100364, 74413, 97797, 20002, 104004, 100652, 104775, 104992, 99464, 109020, 1773, 151645] inputs: <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user 请介绍一下QST终端安全管理系统。<|im_end|> <|im_start|>assistant QST终端安全管理系统软件是，以大数据、云计算、人工智能等新技术为支撑，以可靠服务为保障，集终端安全管控以及终端智能运维于一体的企业级安全产品。本产品通过补丁管理与漏洞修复、资产管理、终端管控、准入控制等功能，赋予终端更为细致的安全防御策略、更为快速的处置能力，帮助用户构建持续有效的终端安全管理体系。<|im_end|> label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 18, 21, 15, 44054, 230, 78882, 99464, 108635, 103951, 101219, 220, 18, 21, 15, 41479, 231, 35987, 105675, 99226, 99473, 107096, 16872, 3837, 23031, 104315, 5373, 107692, 5373, 104455, 49567, 106571, 17714, 104069, 3837, 23031, 101650, 47874, 17714, 101048, 3837, 42067, 99287, 101320, 5373, 104992, 99464, 105215, 101034, 104992, 100168, 113308, 103394, 104587, 52334, 99464, 82700, 8997, 21894, 82700, 67338, 99287, 101320, 5373, 99622, 101142, 39352, 57218, 108298, 104749, 5373, 109173, 5373, 104992, 105215, 5373, 110150, 100359, 5373, 104992, 101978, 102808, 1479, 49, 111233, 3837, 106908, 104992, 104652, 107193, 106552, 106881, 104238, 5373, 104652, 102146, 9370, 32876, 99733, 99788, 5373, 104652, 101098, 9370, 104776, 99788, 3837, 88086, 113917, 100169, 75405, 99313, 101320, 33108, 104112, 105204, 3837, 100364, 74413, 97797, 20002, 104004, 100652, 104775, 104992, 99464, 109020, 1773, 151645] labels: QST终端安全管理系统软件是，以大数据、云计算、人工智能等新技术为支撑，以可靠服务为保障，集终端安全管控以及终端智能运维于一体的企业级安全产品。本产品通过补丁管理与漏洞修复、资产管理、终端管控、准入控制等功能，赋予终端更为细致的安全防御策略、更为快速的处置能力，帮助用户构建持续有效的终端安全管理体系。<|im_end|> [INFO|configuration_utils.py:731] 2024-06-26 07:13:49,140 >> loading configuration file /home/jovyan/fast-data/models/Qwen1.5-7B-chat/qwen/Qwen1_5-7B-chat/config.json [INFO|configuration_utils.py:796] 2024-06-26 07:13:49,141 >> Model config Qwen2Config { "_name_orpath": "/home/jovyan/fast-data/models/Qwen1.5-7B-chat/qwen/Qwen15-7B-chat", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.2", "use_cache": true, "use_sliding_window": false, "vocab_size": 151936 }

[INFO|modeling_utils.py:3471] 2024-06-26 07:13:49,166 >> loading weights file /home/jovyan/fast-data/models/Qwen1.5-7B-chat/qwen/Qwen1___5-7B-chat/model.safetensors.index.json [INFO|modeling_utils.py:1519] 2024-06-26 07:13:49,167 >> Instantiating Qwen2ForCausalLM model under default dtype torch.float16. [INFO|configuration_utils.py:962] 2024-06-26 07:13:49,168 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }

Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 4/4 [00:12<00:00, 3.18s/it] [INFO|modeling_utils.py:4280] 2024-06-26 07:14:02,827 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.

[INFO|modeling_utils.py:4288] 2024-06-26 07:14:02,827 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/jovyan/fast-data/models/Qwen1.5-7B-chat/qwen/Qwen1_5-7B-chat. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [INFO|configurationutils.py:915] 2024-06-26 07:14:02,831 >> loading configuration file /home/jovyan/fast-data/models/Qwen1.5-7B-chat/qwen/Qwen15-7B-chat/generation_config.json [INFO|configuration_utils.py:962] 2024-06-26 07:14:02,831 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 }

06/26/2024 07:14:03 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. 06/26/2024 07:14:03 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation. 06/26/2024 07:14:03 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 06/26/2024 07:14:03 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 06/26/2024 07:14:03 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,up_proj,k_proj,q_proj,down_proj,gate_proj,v_proj 06/26/2024 07:14:05 - INFO - llamafactory.model.loader - trainable params: 19988480 || all params: 7741313024 || trainable%: 0.2582 [INFO|trainer.py:641] 2024-06-26 07:14:05,317 >> Using auto half precision backend [INFO|trainer.py:2078] 2024-06-26 07:14:05,541 >> Running training [INFO|trainer.py:2079] 2024-06-26 07:14:05,541 >> Num examples = 184 [INFO|trainer.py:2080] 2024-06-26 07:14:05,541 >> Num Epochs = 3 [INFO|trainer.py:2081] 2024-06-26 07:14:05,541 >> Instantaneous batch size per device = 2 [INFO|trainer.py:2084] 2024-06-26 07:14:05,541 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2085] 2024-06-26 07:14:05,541 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2086] 2024-06-26 07:14:05,541 >> Total optimization steps = 33 [INFO|trainer.py:2087] 2024-06-26 07:14:05,545 >> Number of trainable parameters = 19,988,480 15%|███████████▋ | 5/33 [00:16<01:26, 3.09s/it]06/26/2024 07:14:21 - INFO - llamafactory.extras.callbacks - {'loss': 4.2026, 'learning_rate': 4.7221e-05, 'epoch': 0.43, 'throughput': 689.61} {'loss': 4.2026, 'grad_norm': 5.473754405975342, 'learning_rate': 4.722088621637309e-05, 'epoch': 0.43, 'num_input_tokens_seen': 11152} 30%|███████████████████████ | 10/33 [00:32<01:10, 3.07s/it]06/26/2024 07:14:37 - INFO - llamafactory.extras.callbacks - {'loss': 3.5858, 'learning_rate': 3.9501e-05, 'epoch': 0.87, 'throughput': 713.33} {'loss': 3.5858, 'grad_norm': 2.634155750274658, 'learning_rate': 3.9501422739279956e-05, 'epoch': 0.87, 'num_input_tokens_seen': 23136} 45%|██████████████████████████████████▌ | 15/33 [00:47<00:52, 2.94s/it]06/26/2024 07:14:52 - INFO - llamafactory.extras.callbacks - {'loss': 3.1906, 'learning_rate': 2.8558e-05, 'epoch': 1.30, 'throughput': 700.98} {'loss': 3.1906, 'grad_norm': 1.6839869022369385, 'learning_rate': 2.8557870956832132e-05, 'epoch': 1.3, 'num_input_tokens_seen': 33056} 61%|██████████████████████████████████████████████ | 20/33 [01:03<00:42, 3.27s/it]06/26/2024 07:15:09 - INFO - llamafactory.extras.callbacks - {'loss': 2.8992, 'learning_rate': 1.6823e-05, 'epoch': 1.74, 'throughput': 708.72} {'loss': 2.8992, 'grad_norm': 0.875791609287262, 'learning_rate': 1.682330091706446e-05, 'epoch': 1.74, 'num_input_tokens_seen': 45072} 76%|█████████████████████████████████████████████████████████▌ | 25/33 [01:19<00:25, 3.17s/it]06/26/2024 07:15:25 - INFO - llamafactory.extras.callbacks - {'loss': 2.7629, 'learning_rate': 6.9066e-06, 'epoch': 2.17, 'throughput': 712.88} {'loss': 2.7629, 'grad_norm': 0.9790578484535217, 'learning_rate': 6.906649047373246e-06, 'epoch': 2.17, 'num_input_tokens_seen': 56720} 91%|█████████████████████████████████████████████████████████████████████ | 30/33 [01:34<00:08, 2.99s/it]06/26/2024 07:15:40 - INFO - llamafactory.extras.callbacks - {'loss': 2.8635, 'learning_rate': 1.0127e-06, 'epoch': 2.61, 'throughput': 710.65} {'loss': 2.8635, 'grad_norm': 1.4785233736038208, 'learning_rate': 1.0126756596375686e-06, 'epoch': 2.61, 'num_input_tokens_seen': 67168} 100%|████████████████████████████████████████████████████████████████████████████| 33/33 [01:44<00:00, 3.15s/it][INFO|trainer.py:2329] 2024-06-26 07:15:50,455 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 104.91, 'train_samples_per_second': 5.262, 'train_steps_per_second': 0.315, 'train_loss': 3.204081650936242, 'epoch': 2.87, 'num_input_tokens_seen': 74944} 100%|████████████████████████████████████████████████████████████████████████████| 33/33 [01:44<00:00, 3.18s/it] [INFO|trainer.py:3410] 2024-06-26 07:15:50,457 >> Saving model checkpoint to saves/Qwen1.5-7B-Chat/lora/train_2024-06-26-07-12-22 /usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /home/jovyan/fast-data/models/Qwen1.5-7B-chat/qwen/Qwen1___5-7B-chat - will assume that the vocabulary was not modified. warnings.warn( [INFO|tokenization_utils_base.py:2513] 2024-06-26 07:15:50,615 >> tokenizer config file saved in saves/Qwen1.5-7B-Chat/lora/train_2024-06-26-07-12-22/tokenizer_config.json [INFO|tokenization_utils_base.py:2522] 2024-06-26 07:15:50,615 >> Special tokens file saved in saves/Qwen1.5-7B-Chat/lora/train_2024-06-26-07-12-22/special_tokens_map.json train metrics epoch = 2.8696 num_input_tokens_seen = 74944 total_flos = 2981303GF train_loss = 3.2041 train_runtime = 0:01:44.90 train_samples_per_second = 5.262 train_steps_per_second = 0.315 Figure saved at: saves/Qwen1.5-7B-Chat/lora/train_2024-06-26-07-12-22/training_loss.png 06/26/2024 07:15:50 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot. [INFO|modelcard.py:450] 2024-06-26 07:15:50,880 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

Expected behavior

使用训练数据集中的问题向大模型提问，答案理论上应该是训练数据集中的答案。

训练耗时太短，不足2分钟。

Others

我是初次接触，望大佬们多多指点！！！！

hiyouga commented 3 days ago

欠拟合

muliu commented 3 days ago

@hiyouga 需要怎么操作去”拟合“呢？麻烦再稍微解释下，我还不懂，感谢！！

muliu commented 3 days ago

我说下我的操作步骤： 1、在 LLaMA Board 界面上训练完后，选择”检查点路径“ = 刚训练的结果。 2、由 ”train“ 切换到 ”chat“ TAB。 3、点击”加载模型“，提示加载成功后，向模型发问。

问的是自定义数据集的问题，得到的答案不是数据集中的答案。

是不是我操作不对，中间缺少什么步骤？望指点，感谢！

chenchun0629 commented 2 days ago

这需求感觉用rag更适合。

才训练2分钟，数据量太少了？

如果从微调角度看，训练结束时train_loss = 3.2041有点高，可以尝试lora_rank 提高到16或者32试试，lora_alpha 可以不用特别设置。或者再多训练几个epoch。总体来说我觉得数据量太少了不适合微调。

仅供参考~

hiyouga / LLaMA-Factory