Open iamwangyabin opened 3 hours ago
I have tested MVSS training config. MVSS does not have the above problems; I can train with more than 4 GPUs without error, and the metrics look good.
Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.
Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.
Yes, I understand that, and it's not a significant issue since GradScaler can skip these NaN losses during backpropagation
I observed that the count for the image-level metric (e.g., "image-level Accuracy") is sometimes only 1... That's why I got the accuracy larger than 1. The total number of test images is 1000. And the Pixel-level metrics are right.
[12:39:12.063887] defaultdict(<class 'IMDLBenCo.training_scripts.utils.misc.SmoothedValue'>, {'pixel-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242c50>, 'pixel-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>, 'image-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218a30>, 'image-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218bb0>})
(Pdb) metric_logger.meters['pixel-level Accuracy'#]
[12:39:29.361305] *** SyntaxError: '[' was never closed
(Pdb) metric_logger.meters['pixel-level Accuracy']
[12:39:32.102573] <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[12:39:44.795894] 1000
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[12:39:51.632652] 286.1601448059082
(Pdb) metric_logger.meters['image-level Accuracy'].count
[12:40:10.770748] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[12:40:18.716873] 8.168
I think there is the problem exists here: https://github.com/scu-zjz/IMDLBenCo/blob/2ef150e46dfc886d6e7dc393ddac85cf505e1f46/IMDLBenCo/training_scripts/utils/misc.py#L42
After installing the package, I attempted to train using the default config to train the TruFor. However, I encountered significant issues when trying to train on more than 2 GPUs. The training process frequently breaks down without providing any error information.
When I finally managed to train the model on 2 H100 GPUs, I observed NaN losses occurring intermittently during training. Even though GradScaler is supposed to skip NaN values. Below is an example from the training log:
I've also noticed that the reported Image-level Accuracy values are greater than 1, which should be impossible for accuracy metrics. Here's an example from the log:
The image-level Accuracy is reported as 7.7600, which is not possible for a standard accuracy metric that should range from 0 to 1.