Training Issues with TruFor

iamwangyabin commented 3 hours ago

After installing the package, I attempted to train using the default config to train the TruFor. However, I encountered significant issues when trying to train on more than 2 GPUs. The training process frequently breaks down without providing any error information.

When I finally managed to train the model on 2 H100 GPUs, I observed NaN losses occurring intermittently during training. Even though GradScaler is supposed to skip NaN values. Below is an example from the training log:


1384 [03:15:23.681523] Epoch: [2300/3613] eta: 0:33:55 lr: 0.000001 loss_ce: 0.3530 (nan) dice_loss: 0.3622 (0.3642) combined_loss: 0.3518 (nan) time: 1.5536 data: 0.0002 max mem: 55633
1385 [03:15:54.706125] Epoch: [2320/3613] eta: 0:33:24 lr: 0.000001 loss_ce: 0.3381 (nan) dice_loss: 0.2943 (0.3640) combined_loss: 0.2975 (nan) time: 1.5511 data: 0.0002 max mem: 55633
...
1392 [03:19:31.109580] Epoch: [2460/3613] eta: 0:29:47 lr: 0.000001 loss_ce: 0.3052 (nan) dice_loss: 0.2988 (0.3629) combined_loss: 0.3212 (nan) time: 1.5526 data: 0.0002 max mem: 55633

I've also noticed that the reported Image-level Accuracy values are greater than 1, which should be impossible for accuracy metrics. Here's an example from the log:


2356 [09:08:44.806920] Test: [4]  [1220/1355]  eta: 0:00:58  pixel-level F1: 3.6400 (0.2459)  pixel-level Accuracy: 15.1577 (0.9424)  time: 0.4322  data: 0.0002  max mem: 55633
2357 [09:08:53.450127] Test: [4]  [1240/1355]  eta: 0:00:50  pixel-level F1: 3.2157 (0.2455)  pixel-level Accuracy: 15.1957 (0.9425)  time: 0.4321  data: 0.0002  max mem: 55633
2358 [09:09:02.094049] Test: [4]  [1260/1355]  eta: 0:00:41  pixel-level F1: 3.6974 (0.2455)  pixel-level Accuracy: 14.7980 (0.9423)  time: 0.4322  data: 0.0002  max mem: 55633
2359 [09:09:10.735877] Test: [4]  [1280/1355]  eta: 0:00:32  pixel-level F1: 4.0385 (0.2458)  pixel-level Accuracy: 14.9980 (0.9424)  time: 0.4321  data: 0.0001  max mem: 55633
2360 [09:09:19.373706] Test: [4]  [1300/1355]  eta: 0:00:23  pixel-level F1: 3.4870 (0.2458)  pixel-level Accuracy: 15.2521 (0.9424)  time: 0.4319  data: 0.0002  max mem: 55633
2361 [09:09:28.008097] Test: [4]  [1320/1355]  eta: 0:00:15  pixel-level F1: 3.5316 (0.2453)  pixel-level Accuracy: 15.1872 (0.9424)  time: 0.4317  data: 0.0001  max mem: 55633
2362 [09:09:36.643513] Test: [4]  [1340/1355]  eta: 0:00:06  pixel-level F1: 3.8572 (0.2452)  pixel-level Accuracy: 15.2518 (0.9426)  time: 0.4317  data: 0.0001  max mem: 55633
2363 [09:09:42.688089] Test: [4]  [1354/1355]  eta: 0:00:00  pixel-level F1: 3.8572 (0.2451)  pixel-level Accuracy: 15.1298 (0.9426)  time: 0.4317  data: 0.0001  max mem: 55633
2364 [09:09:42.862711] Test: [4] Total time: 0:09:51 (0.4362 s / it)
2365 [09:09:42.863425] ***************************************************************
2366 [09:09:42.863506] ****An extra tail dataset should exist for accracy metrics!****
2367 [09:09:42.863562] ***************************************************************
2368 [09:09:42.863615] **** Length of tail: 5 ****
2369 [09:09:43.297684] ====================
2370 [09:09:43.298023] A batch that is not fully loaded was detected at the end of the dataset. The actual batch size for this batch is 5: The default batch size is 16
2371 [09:09:43.298088] ====================
2372 [09:09:43.298470] Actual Batchsize/ world_size {'_n': 2.5}
2373 [09:09:43.298647] {'pixel-level F1': tensor(0., device='cuda:0', dtype=torch.float64)}
2374 [09:09:43.328352] Actual Batchsize/ world_size {'_n': 2.5}
2375 [09:09:43.328514] {'pixel-level Accuracy': tensor(2.4969, device='cuda:0', dtype=torch.float64)}
2376 [09:09:43.330867] Test <remaining>: [4]  [0/1]  eta: 0:00:00  pixel-level F1: 3.8376 (0.2451)  pixel-level Accuracy: 15.0991 (0.9426)  time: 0.4655  data: 0.3176  max mem: 55633
2377 [09:09:43.331108] Test <remaining>: [4] Total time: 0:00:00 (0.4661 s / it)
2378 [09:09:45.373779] ---syncronized---
2379 [09:09:45.374095] pixel-level F1 reduced_count 43365
2380 [09:09:45.374198] pixel-level F1 reduced_sum 10573.642782675543
2381 [09:09:45.374291] image-level F1 reduced_count 2
2382 [09:09:45.374378] image-level F1 reduced_sum 1.0861882453763827
2383 [09:09:45.374466] pixel-level Accuracy reduced_count 43365
2384 [09:09:45.374551] pixel-level Accuracy reduced_sum 40856.14878845215
2385 [09:09:45.374638] image-level Accuracy reduced_count 2
2386 [09:09:45.374729] image-level Accuracy reduced_sum 15.531676273922066
2387 [09:09:45.374817] ---syncronized done ---
2388 [09:09:49.053563] Averaged stats: pixel-level F1: 3.8376 (0.2438)  pixel-level Accuracy: 15.0991 (0.9421)  image-level F1: 0.5431 (0.5431)  image-level Accuracy: 7.7658 (7.7658)
2389 [09:09:49.059104] Best pixel-level F1 = 0.24382895843826918

The image-level Accuracy is reported as 7.7600, which is not possible for a standard accuracy metric that should range from 0 to 1.

iamwangyabin commented 3 hours ago

I have tested MVSS training config. MVSS does not have the above problems; I can train with more than 4 GPUs without error, and the metrics look good.

dddb11 commented 2 hours ago

Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.

iamwangyabin commented 2 hours ago

Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.

Yes, I understand that, and it's not a significant issue since GradScaler can skip these NaN losses during backpropagation

iamwangyabin commented 19 minutes ago

I observed that the count for the image-level metric (e.g., "image-level Accuracy") is sometimes only 1... That's why I got the accuracy larger than 1. The total number of test images is 1000. And the Pixel-level metrics are right.

[12:39:12.063887] defaultdict(<class 'IMDLBenCo.training_scripts.utils.misc.SmoothedValue'>, {'pixel-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242c50>, 'pixel-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>, 'image-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218a30>, 'image-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218bb0>})
(Pdb) metric_logger.meters['pixel-level Accuracy'#]
[12:39:29.361305] *** SyntaxError: '[' was never closed
(Pdb) metric_logger.meters['pixel-level Accuracy']
[12:39:32.102573] <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[12:39:44.795894] 1000
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[12:39:51.632652] 286.1601448059082
(Pdb) metric_logger.meters['image-level Accuracy'].count
[12:40:10.770748] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[12:40:18.716873] 8.168

I think there is the problem exists here: https://github.com/scu-zjz/IMDLBenCo/blob/2ef150e46dfc886d6e7dc393ddac85cf505e1f46/IMDLBenCo/training_scripts/utils/misc.py#L42

scu-zjz / IMDLBenCo

Training Issues with TruFor #42