`Dice Loss` vs `Dice Loss with smooting term`

valosekj commented 8 months ago

This issue discusses differences in the implementation of the Dice Loss with and without the smoothing term.

Background why opening this issue/discussion

tl;dr:

nnunetv2 nnUNetTrainerDiceCELoss_noSmooth trainer (i.e., without the smoothing term of the Dice loss) helped the model from collapsing to zero during lesion model training.

Details

Since the default `nnUNetTrainer` trainer was collapsing to zero when training the DCM (degenerative cervical myelopathy) lesion segmentation model, we tried `nnUNetTrainerDiceCELoss_noSmooth` (i.e., without the smoothing term of the Dice loss). This trainer was discovered by @naga-karthik in these two nnunet threads ([1](https://github.com/MIC-DKFZ/nnUNet/issues/1395#issuecomment-1778621176), [2](https://github.com/MIC-DKFZ/nnUNet/issues/812)). The trainer indeed helped, and the model was no longer collapsing to zero; see details in [this issue](https://github.com/ivadomed/model-seg-dcm/issues/1#issuecomment-1930151543). Note that DCM lesion segmentation presents a high-class imbalance (lesions are small objects).

Comparison of the default and `nnUNetTrainerDiceCELoss_noSmooth` trainers

tl;dr:

the default nnUNetTrainer trainer uses smooth: float = 1.
nnUNetTrainerDiceCELoss_noSmooth uses 'smooth': 0

Details

### nnunetv2 default trainer The nnunetv2 default trainer uses `MemoryEfficientSoftDiceLoss` (see [L352-L362](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py#L352-L362) in [nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py)). This `MemoryEfficientSoftDiceLoss` (see [L58](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/loss/dice.py#L58) in [nnunetv2/training/loss/dice.py](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/loss/dice.py)) uses **_both_** smoothing term (`self.smooth`) and small constant (`1e-8`); see [L116](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/loss/dice.py#L116): ```python dc = (2 * intersect + self.smooth) / (torch.clip(sum_gt + sum_pred + self.smooth, 1e-8)) ``` --- ### nnunetv2 `nnUNetTrainerDiceCELoss_noSmooth` trainer The nnunetv2 `nnUNetTrainerDiceCELoss_noSmooth` trainer (see [L32](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/nnUNetTrainer/variants/loss/nnUNetTrainerDiceLoss.py#L32) in [nnunetv2/training/nnUNetTrainer/variants/loss/nnUNetTrainerDiceLoss.py](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/nnUNetTrainer/variants/loss/nnUNetTrainerDiceLoss.py)) sets `smooth` to `0`. The small constant (1e-8) is apparently untouched and kept.

What is the smoothing term used for?

tl;dr: hard to say convincingly.

keras and ivadomed use only the smoothing term without the small constant
nnunetv2 default trainer uses both the smoothing term and the small constant
nnunetv2 nnUNetTrainerDiceCELoss_noSmooth trainer uses only the small constant (because the smoothing term is set to zero)

Details

Initially, I incorrectly thought that the nnunetv2 smoothing term was used to prevent division by zero. I got this sense based on [this comment](https://github.com/keras-team/keras/issues/3611#issuecomment-492294505). But, after a deeper look at the equation in this comment, I found out that the equation uses only the smoothing term but no small constant. Further investigation led me to these two discussions ([1](https://stackoverflow.com/questions/51973856/how-is-the-smooth-dice-loss-differentiable), [2](https://gist.github.com/wassname/7793e2058c5c9dacb5212c0ac0b18a8a)) about the Dice implementation in keras. Both discussions use only the smoothing term but again, but no small constant: ```python score = (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth) ``` Checking the ivadomed Dice implementation, and finding that it also uses only the smoothing term (see [L63](https://github.com/ivadomed/ivadomed/blob/881dc6804323c7ccfcd30968c8f8113cc86fcfb9/ivadomed/losses.py#L63) in [ivadomed/losses.py](https://github.com/ivadomed/ivadomed/blob/master/ivadomed/losses.py)): ```python return - (2.0 * intersection + self.smooth) / (iflat.sum() + tflat.sum() + self.smooth) ``` I also found [this comment](https://github.com/ivadomed/ivadomed/issues/183) from Charley Gros providing the following explanation (note that this comment is related to the ivadomed Dice without the small constant): > A very probable reason is the different way these two functions are dealing with empty GT and empty pred. --> Dice loss returns 1 Both keras and ivadomed implementations are in contrast with the nnunet implementation, which uses **_both_** smoothing term (`self.smooth`) and small constant (`1e-8`); see [L116](https://github.com/MIC-DKFZ/nnUNet/blob/997804c7510634dc8fd83f1194b434c60815a93e/nnunetv2/training/loss/dice.py#L116): ```python dc = (2 * intersect + self.smooth) / (torch.clip(sum_gt + sum_pred + self.smooth, 1e-8)) ``` Prompting chatGPT to explain why removing the smoothing term from the nnunet dice helped from collapsing to zero, provides a relatively reasonable explanation (highlighted in italics): > The smoothing term, as used here, serves a dual purpose: > > 1. Stabilization of the Dice Coefficient: By adding the smoothing term (self.smooth) to both the numerator (nominator) and the denominator, it ensures that even when there are no true positives (TP) (e.g., the model predicts no lesion where there is actually no lesion, or it fails to predict a lesion where there is one), the Dice coefficient does not become undefined or result in division by zero. The use of torch.clip with a minimum value of 1e-8 further ensures that the denominator cannot be zero, which prevents division by zero errors. > 2. Influence on Gradient Behavior and Model Sensitivity: The smoothing term affects the gradient of the loss with respect to the model's predictions. When TP is very small or zero, the addition of the smoothing term to both the numerator and denominator ensures that the model still receives a gradient signal that encourages it to improve its predictions. It essentially softens the harshness of the Dice loss, making it more forgiving in the early stages of training or in cases where the model struggles to detect small or sparse lesions. This can lead to a more stable and gradual learning process. > > _In your specific case, removing the smoothing term with the nnUNetTrainerDiceCELoss_noSmooth trainer likely made the loss more directly reflective of the actual performance of the model in terms of the overlap between the predicted segmentation and the ground truth. For the task of lesion segmentation, where lesions can be small and the balance between lesion and non-lesion areas is crucial, the direct feedback without the smoothing term's moderation might have better aligned the loss with the task's objectives, thereby improving model performance._

Further investigation and experiments comparing the nnunet default nnUNetTrainer trainer and nnUNetTrainerDiceCELoss_noSmooth are in progress.

Tagging @naga-karthik and @plbenveniste, who both also work on lesion segmentation. If any of you had time to go through the investigation above to check if I didn't make any naive mistakes, it would be great.

jcohenadad commented 8 months ago

also tagging @Nilser3

hermancollin commented 8 months ago

What do you mean by collapsing to zero? Class imbalance was too high so that the model output zeros everywhere after 1000 epochs? If so, what is the behavior of the best checkpoints (as opposed to final checkpoint)?

valosekj commented 8 months ago

What do you mean by collapsing to zero? Class imbalance was too high so that the model output zeros everywhere after 1000 epochs?

The model was crashing to zero after 100-250 epochs, depending on the fold. See the training progress in this comment.

hermancollin commented 8 months ago

@valosekj ok. nnunet struggles with your second class (which I'm guessing is the lesion class).

Have you tried opening an issue or discussion on the nnunet repo? Last time I checked I remember the main contributor was still pretty active. He might have some good insights on this phenomenon? Because that behavior is a little bit weird

valosekj commented 8 months ago

@valosekj ok. nnunet struggles with your second class (which I'm guessing is the lesion class).

Exactly!

Have you tried opening an issue or discussion on the nnunet repo? Last time I checked I remember the main contributor was still pretty active. He might have some good insights on this phenomenon? Because that behavior is a little bit weird

We solved the collapsing to zero by using the nnUNetTrainerDiceCELoss_noSmooth trainer based on these two nnunet threads (1, 2), as I tried to describe in the first comment. If the first comment is unclear, please let me know, and I will rephrase it. This discussion aims to figure out what smoothing is responsible for and why removing it helped in model training.

hermancollin commented 8 months ago

Does this only happen with region-based training? We trained a model on very small objects, although the class imbalance was maybe less pronounced than yours (without collapse)

valosekj commented 8 months ago

Does this only happen with region-based training? We trained a model on very small objects, although the class imbalance was maybe less pronounced than yours (without collapse)

For the region-based model, the training was collapsing to zero after 100-250 epochs (details here).
For the multi-channel model (trained using T2w_ax image and spinal cord segmentation as input channels to segment lesions as output), the model was learning nothing (details here). Changing the trainer to nnUNetTrainerDiceCELoss_noSmooth for the multi-channel model helped; see here.

valosekj commented 8 months ago

Looking into MONAI DiceLoss:

f: torch.Tensor = 1.0 - (2.0 * intersection + self.smooth_nr) / (denominator + self.smooth_dr)

where

smooth_nr: a small constant added to the numerator to avoid zero.
smooth_dr: a small constant added to the denominator to avoid nan.

with default values

smooth_nr: float = 1e-5,
smooth_dr: float = 1e-5,

This indicates that both "smoothing" terms in MONAI implementation are basically just small constants allowing the division. This is in contrast with the "smoothing" term equal to 1 as used in keras, ivadomed, and nnunetv2.

naga-karthik commented 8 months ago

"smoothing" term equal to 1 as used in keras, ivadomed, and nnunetv2.

I think this is the major issue with the dice loss implementations in those packages. Having a big term (i.e. 1) is interfering with the loss calculation (and consequently the gradient signals passed through the network) when learning to segment small, heavily class-imbalanced objects (i.e. lesions)

hermancollin commented 8 months ago

very interesting. In this comment, Fabian reports similar problem on the LIDC dataset, which is a lesion segmentation task like yours. From my understanding, the Dice loss can fail in 2 ways:

intersection=0; in this case, we would get a dice loss of 0, regardless of if the GT/pred are empty (in which case we would like to have a value of 1 instead of 0, has Charley mentioned, hence the smoothing term in the numerator)
addition=0 (in the denominator): this would give us a dice loss of NaN, but the smoothness term makes that impossible.

Based on this, we can safely say the problematic part is the intersection. I think the fact that your dataset and the LIDC datasets are problematic is because of this intersection term. Because your masks are mostly empty, this intersection is very close to 0 (remember the dice loss takes a softmax as input - not a binary mask, so the intersection CAN be in [0,1]). The signal is too weak and, as @naga-karthik mentioned, the smoothness=1 term overshadows the weak signal you have inside the intersection term.

Maybe something to try would be to hardcode a different smoothness term in the dice computation. I reckon a smaller value would not make the training collapse. If that is the case, we could report it back to the nnunet guys, as they didn't seem to know what was going on.

naga-karthik commented 8 months ago

Thanks for your thoughts, @hermancollin ! I think we can safely proceed with how MONAI has implemented DiceLoss (i.e. setting smoothing to a small constant such as 1e-5, which should be small enough to work with lesion segmentation problems and others where the object-to-segment is large

function2-llx commented 8 months ago

I think we can safely proceed with how MONAI has implemented DiceLoss (i.e. setting smoothing to a small constant such as 1e-5

FYI, in that comment Fabian explicitly mentioned that 1e-5 may not work.

The 1e-8 should probably not be there and it should use clip instead. No idea why this causes a problem with the default smooth of 1e-5 and does not cause problems with smooth=0.

Also, nnUNet does not use a default smooth value of 1, but actually 1e-5. Indeed it defined the default value in __init__ here with smooth=1:

class SoftDiceLoss(nn.Module):
    def __init__(self, apply_nonlin: Callable = None, batch_dice: bool = False, do_bg: bool = True, smooth: float = 1.,

However, the actually used value is defined here in the nnUNetTrainer, which is 1e-5:


    def _build_loss(self):
        if self.label_manager.has_regions:
            loss = DC_and_BCE_loss({},
                                   {'batch_dice': self.configuration_manager.batch_dice,
                                    'do_bg': True, 'smooth': 1e-5, 'ddp': self.is_ddp},
                                   use_ignore_label=self.label_manager.ignore_label is not None,
                                   dice_class=MemoryEfficientSoftDiceLoss)
        else:
            loss = DC_and_CE_loss({'batch_dice': self.configuration_manager.batch_dice,
                                   'smooth': 1e-5, 'do_bg': False, 'ddp': self.is_ddp}, {}, weight_ce=1, weight_dice=1,
                                  ignore_label=self.label_manager.ignore_label, dice_class=MemoryEfficientSoftDiceLoss)

ivadomed / utilities