rayleizhu / BiFormer

[CVPR 2023] Official code release of our paper "BiFormer: Vision Transformer with Bi-Level Routing Attention"
https://arxiv.org/abs/2303.08810
MIT License
460 stars 36 forks source link

分割模型loss为nan #47

Closed shuli12318 closed 3 months ago

shuli12318 commented 3 months ago

大佬您好,我在复现代码时遇到loss为nan的问题。想问问您我是不是哪里配置错误了。 1 我的任务是语义分割,使用sfpn.biformer_small.py模型,导入您的biformer_small_best.pth权重文件。训练使用4卡24G的4090,Batchsize为8,(您是8卡4batchsize)因此我的lr和iters和您一样。 2 我没有使用slurm,而是使用pytorch启动器进行多卡训练。我的脚本如下:

!/usr/bin/env bash

PARTITION=mediasuper NOW=$(date '+%m-%d-%H:%M:%S') JOB_NAME=${MODEL}

CONFIG_DIR=configs/ade20k MODEL=sfpn.biformer_small CKPT=pretrained/biformer_small_best.pth CONFIG=${CONFIG_DIR}/${MODEL}.py

OUTPUT_DIR=../outputs/seg WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW} mkdir -p ${WORK_DIR} PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \

export CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \ --nproc_per_node=4 \ --master_port=29501 \ train.py --config=${CONFIG} \ --launcher="pytorch" \ --work-dir=${WORK_DIR} \ --options model.pretrained=${CKPT} \ &> ${WORK_DIR}/train.${JOB_NAME}.log & 3 在训练道32000iters之后,出现了loss消失的问题。并且我重新训练,或者将32000iters的pth文件重新加载后依然会在同样的地方产生相同的问题。因为第一个epoch已经训练完成了,数据肯定是没有问题的,我查阅了下csdn,我想问下这是不是跟训练精度有关呢? 2024-04-09 17:27:05,483 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2024-04-09 17:27:28,017 - mmseg - INFO - Iter [32050/80000] lr: 1.307e-04, eta: 5 days, 16:27:25, time: 2.407, data_time: 1.969, memory: 15179, decode.loss_ce: 0.2974, decode.acc_seg: 88.5982, loss: 0.2974 2024-04-09 17:27:50,423 - mmseg - INFO - Iter [32100/80000] lr: 1.305e-04, eta: 2 days, 23:47:03, time: 0.448, data_time: 0.013, memory: 15179, decode.loss_ce: 0.3082, decode.acc_seg: 88.0161, loss: 0.3082 2024-04-09 17:28:12,847 - mmseg - INFO - Iter [32150/80000] lr: 1.303e-04, eta: 2 days, 1:56:17, time: 0.448, data_time: 0.012, memory: 15179, decode.loss_ce: 0.3059, decode.acc_seg: 88.2027, loss: 0.3059 2024-04-09 17:28:37,074 - mmseg - INFO - Iter [32200/80000] lr: 1.302e-04, eta: 1 day, 15:04:37, time: 0.485, data_time: 0.011, memory: 15179, decode.loss_ce: nan, decode.acc_seg: 49.3611, loss: nan 2024-04-09 17:29:03,193 - mmseg - INFO - Iter [32250/80000] lr: 1.300e-04, eta: 1 day, 8:38:24, time: 0.522, data_time: 0.010, memory: 15179, decode.loss_ce: nan, decode.acc_seg: 15.7751, loss: nan 2024-04-09 17:29:29,717 - mmseg - INFO - Iter [32300/80000] lr: 1.298e-04, eta: 1 day, 4:21:26, time: 0.530, data_time: 0.011, memory: 15179, decode.loss_ce: nan, decode.acc_seg: 15.8742, loss: nan 2024-04-09 17:29:55,593 - mmseg - INFO - Iter [32350/80000] lr: 1.296e-04, eta: 1 day, 1:16:04, time: 0.518, data_time: 0.012, memory: 15179, decode.loss_ce: nan, decode.acc_seg: 17.5384, loss: nan

shuli12318 commented 3 months ago

我解决了,我将batch调整到4,lr=0.0001,iters=160000。nan的现象就消失了,并且miou=49.3,比论文还要好一点。