'src.loss_imnet_feat_dist' is nan

yuki-no-hana commented 2 years ago

Thanks for your great work. During training, src.loss_imnet_feat_dist is nan at the beginning. Is it right?

2022-02-19 08:15:52,120 - mmseg - INFO - Iter [50/40000] lr: 1.958e-06, eta: 1 day, 2:59:33, time: 2.432, data_time: 0.081, memory: 9808, decode.loss_seg: 2.6839, decode.acc_seg: 10.5387, src.loss_imnet_feat_dist: nan, mix.decode.loss_seg: 1.4047, mix.decode.acc_seg: 19.1339 2022-02-19 08:17:38,012 - mmseg - INFO - Iter [100/40000] lr: 3.950e-06, eta: 1 day, 1:12:58, time: 2.118, data_time: 0.033, memory: 9808, decode.loss_seg: 2.3862, decode.acc_seg: 47.4850, src.loss_imnet_feat_dist: nan, mix.decode.loss_seg: 1.3233, mix.decode.acc_seg: 41.3826 2022-02-19 08:19:28,666 - mmseg - INFO - Iter [150/40000] lr: 5.938e-06, eta: 1 day, 0:57:19, time: 2.213, data_time: 0.034, memory: 9808, decode.loss_seg: 2.0347, decode.acc_seg: 62.5967, src.loss_imnet_feat_dist: nan, mix.decode.loss_seg: 1.0585, mix.decode.acc_seg: 59.3449 2022-02-19 08:21:15,993 - mmseg - INFO - Iter [200/40000] lr: 7.920e-06, eta: 1 day, 0:37:33, time: 2.147, data_time: 0.033, memory: 9808, decode.loss_seg: 1.6078, decode.acc_seg: 68.1829, src.loss_imnet_feat_dist: nan, mix.decode.loss_seg: 0.7838, mix.decode.acc_seg: 68.8032 2022-02-19 08:23:03,135 - mmseg - INFO - Iter [250/40000] lr: 9.898e-06, eta: 1 day, 0:24:29, time: 2.143, data_time: 0.032, memory: 9808, decode.loss_seg: 1.3028, decode.acc_seg: 68.6837, src.loss_imnet_feat_dist: nan, mix.decode.loss_seg: 0.6529, mix.decode.acc_seg: 70.9704 2022-02-19 08:24:50,133 - mmseg - INFO - Iter [300/40000] lr: 1.187e-05, eta: 1 day, 0:14:51, time: 2.140, data_time: 0.034, memory: 9808, decode.loss_seg: 1.0986, decode.acc_seg: 70.4091, src.loss_imnet_feat_dist: nan, mix.decode.loss_seg: 0.5765, mix.decode.acc_seg: 72.7845 2022-02-19 08:26:36,420 - mmseg - INFO - Iter [350/40000] lr: 1.384e-05, eta: 1 day, 0:06:07, time: 2.126, data_time: 0.031, memory: 9808, decode.loss_seg: 0.9639, decode.acc_seg: 71.2223, src.loss_imnet_feat_dist: nan, mix.decode.loss_seg: 0.5049, mix.decode.acc_seg: 75.3486

Looking forward to your reply!

lhoyer commented 2 years ago

The nan in src.loss_imnet_feat_dist is a logging problem but not a training issue. The provided checkpoint for GTA->Cityscapes (https://drive.google.com/file/d/1pG3kDClZDGwp1vSTEXmTchkGHmnLQNdP/view?usp=sharing) shows the same behavior.

The nan loss value is caused when the thing-class mask is empty, which can happen for samples without sufficiently large thing-classes. In this case, the mean of an empty tensor is calculated, which is nan: https://github.com/lhoyer/DAFormer/blob/4447661ab92488fb5ece5dd69fb3f7eb62f930c4/mmseg/models/uda/dacs.py#L155 As the logging aggregates the values over 50 steps, it will be nan as soon as one training step has a nan loss, which is quite probable.

However, as there is no connection in the compute graph to the network weights if the masked tensor is empty, the network gradients are zero and no weight update will happen in this case. You can also experimentally verify this using print_grad_magnitude.

If you want to properly log the feature distance loss, you can set it to zero if pw_feat_dist is empty after masking.

yuki-no-hana commented 2 years ago

Thanks a lot.

kaigelee commented 2 years ago

Thanks a lot.

My decode.loss_seg is also nan, could you help me fix it?

lhoyer commented 2 years ago

Thanks a lot.

My decode.loss_seg is also nan, could you help me fix it?

This is a separate issue. Please continue the discussion in issue https://github.com/lhoyer/DAFormer/issues/22.

lhoyer / DAFormer

'src.loss_imnet_feat_dist' is nan #11