Open Echo0125 opened 1 week ago
单机8卡训练qwen2-vl 7b正常step不会报错,更换2机16卡后,首先是worker机报warning
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
随后,loss突然增大,grad_norm变成nan:
{'loss': 1.80130756, 'acc': 0.56179339, 'grad_norm': 37.75608444, 'learning_rate': 0.0, 'memory(GiB)': 62.94, 'train_speed(iter/s)': 0.007755, 'epoch': 0.0, 'global_step/max_steps': '1/2169', 'percentage': '0.05%', 'elapsed_time': '1m 22s', 'remaining_time': '2d 1h 43m 27s'} {'loss': 663.56591797, 'acc': 0.14326748, 'grad_norm': nan, 'learning_rate': 3.84e-06, 'memory(GiB)': 73.64, 'train_speed(iter/s)': 0.010765, 'epoch': 0.0, 'global_step/max_steps': '5/2169', 'percentage': '0.23%', 'elapsed_time': '6m 58s', 'remaining_time': '2d 2h 15m 22s'}
机器为H800, config为
NUM_NODE=$1 RANK=$2 MASTER_ADDR_=$3 echo "MASTER_ADDR=$MASTER_ADDR_" CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NNODES=${NUM_NODE} \ NODE_RANK=${RANK} \ MASTER_ADDR=${MASTER_ADDR_} \ NPROC_PER_NODE=8 \ swift sft \ --model_type qwen2-vl-7b-instruct \ --model_id_or_path ./checkpoints/Qwen2-VL-7B-Instruct \ --sft_type full \ --dataset ./train.json \ --ddp_backend nccl \ --warmup_ratio 0.03 \ --weight_decay 0.1 \ --deepspeed default-zero2 \ --batch_size 1 \ --gradient_accumulation_steps 8 \ --save_strategy epoch \ --use_flash_attn True \ --eval_steps 500 \ --val_dataset ./val.json \ --save_total_limit 1 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --lazy_tokenize true \ --save_only_model true \ --tuner_backend swift \ --dtype bf16 \ --save_steps 5000
遇到了类似现象,loss还是下降的,gradnorm很高,最后nan。internvl2-8b也遇到了。
单机8卡训练qwen2-vl 7b正常step不会报错,更换2机16卡后,首先是worker机报warning
随后,loss突然增大,grad_norm变成nan:
机器为H800, config为