train_loss与正常loss对不上

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

{'loss': 1.63498573, 'acc': 0.58337445, 'grad_norm': 5.90351009, 'learning_rate': 1e-08, 'memory(GiB)': 20.14, 'train_speed(iter/s)': 2.450407, 'epoch': 1.0, 'global_step/max_steps': '11035/11043', 'percentage': '99.93%', 'elapsed_time': '1h 14m 53s', 'remaining_time': '3s'} {'loss': 1.60251293, 'acc': 0.58529301, 'grad_norm': 5.59085035, 'learning_rate': 1e-08, 'memory(GiB)': 20.14, 'train_speed(iter/s)': 2.448038, 'epoch': 1.0, 'global_step/max_steps': '11040/11043', 'percentage': '99.97%', 'elapsed_time': '1h 15m 0s', 'remaining_time': '1s'} Train: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 11043/11043 [1:15:03<00:00, 1.30s/it] {'eval_loss': 1.61615264, 'eval_acc': 0.59993855, 'eval_runtime': 77.5104, 'eval_samples_per_second': 23.016, 'eval_steps_per_second': 2.877, 'epoch': 1.0, 'global_step/max_steps': '11043/11043', 'percentage': '100.00%', 'elapsed_time': '1h 16m 21s', 'remaining_time': '0s'} Val: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 223/223 [01:16<00:00, 2.92it/s] /home/ubuntu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [INFO:swift] Saving model checkpoint to /home/ubuntu/disk2T_2/wzy/MiniCPM-V/output/minicpm-v-v2_6-chat/v49-20241113-182910/checkpoint-11043 {'train_runtime': 4588.4326, 'train_samples_per_second': 38.509, 'train_steps_per_second': 2.407, 'train_loss': 0.44095722, 'epoch': 1.0, 'global_step/max_steps': '11043/11043', 'percentage': '100.00%', 'elapsed_time': '1h 16m 28s', 'remaining_time': '0s'} Train: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 11043/11043 [1:16:28<00:00, 2.41it/s] 我训练的时候loss是1.63498573但是最后结果的train_loss是0.44095722，请问为什么不一样呢

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

我训练命令是： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft --model_type minicpm-v-v2_6-chat --model_id_or_path OpenBMB/MiniCPM-V-2_6 --sft_type lora --quantization_bit 4 --target_regex "llm..*layers.\d+.self_attn.(q_proj|k_proj|v_proj|o_proj)" --dataset /home/ubuntu/disk2T_2/wzy/MiniCPM-V/data/open-vocabulary_relation_extraction_dataset/standard_train_no_history1760.jsonl --deepspeed zero2-offload --learning_rate 5e-5 --device_max_memory "23GB 23GB 23GB 23GB 23GB 23GB 23GB 23GB" --eval_steps 1000 --resume_from_checkpoint /home/ubuntu/disk2T_2/wzy/MiniCPM-V/output/minicpm-v-v2_6-chat/v47-20241113-135707/checkpoint-8000

Additional context Add any other context about the problem here(在这里补充其他信息)

modelscope / ms-swift

train_loss与正常loss对不上 #2448