modelscope / ms-swift

Use PEFT or Full-parameter to finetune 350+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
3.99k stars 354 forks source link

grad_norm nan #2280

Open Echo0125 opened 1 week ago

Echo0125 commented 1 week ago

单机8卡训练qwen2-vl 7b正常step不会报错,更换2机16卡后,首先是worker机报warning

It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.

随后,loss突然增大,grad_norm变成nan:

{'loss': 1.80130756, 'acc': 0.56179339, 'grad_norm': 37.75608444, 'learning_rate': 0.0, 'memory(GiB)': 62.94, 'train_speed(iter/s)': 0.007755, 'epoch': 0.0, 'global_step/max_steps': '1/2169', 'percentage': '0.05%', 'elapsed_time': '1m 22s', 'remaining_time': '2d 1h 43m 27s'}
{'loss': 663.56591797, 'acc': 0.14326748, 'grad_norm': nan, 'learning_rate': 3.84e-06, 'memory(GiB)': 73.64, 'train_speed(iter/s)': 0.010765, 'epoch': 0.0, 'global_step/max_steps': '5/2169', 'percentage': '0.23%', 'elapsed_time': '6m 58s', 'remaining_time': '2d 2h 15m 22s'}

机器为H800, config为

NUM_NODE=$1
RANK=$2
MASTER_ADDR_=$3

echo "MASTER_ADDR=$MASTER_ADDR_"

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NNODES=${NUM_NODE} \
NODE_RANK=${RANK} \
MASTER_ADDR=${MASTER_ADDR_} \
NPROC_PER_NODE=8 \
swift sft \
  --model_type qwen2-vl-7b-instruct \
  --model_id_or_path ./checkpoints/Qwen2-VL-7B-Instruct \
  --sft_type full \
  --dataset ./train.json \
  --ddp_backend nccl \
  --warmup_ratio 0.03 \
  --weight_decay 0.1 \
  --deepspeed default-zero2 \
  --batch_size 1 \
  --gradient_accumulation_steps 8 \
  --save_strategy epoch \
  --use_flash_attn True \
  --eval_steps 500 \
  --val_dataset ./val.json \
  --save_total_limit 1 \
  --num_train_epochs 1 \
  --learning_rate 2e-5 \
  --lazy_tokenize true \
  --save_only_model true \
  --tuner_backend swift \
  --dtype bf16 \
  --save_steps 5000
yufei1900 commented 5 hours ago

遇到了类似现象,loss还是下降的,gradnorm很高,最后nan。internvl2-8b也遇到了。 screenshot-20241025-170356