Open naga-karthik opened 1 month ago
Good points!
Take the following example -- if there are 3 images in the test set, 2 of them have no lesions and 1 has a lesion. We compare two models A and B. Let's say that model A is not good and predicts 3 empty masks. On the other hand, model B is better than A and predicts 1 empty mask, 2 non-empty masks. Now, because we're setting Dice=1.0 when both GT and Pred lesions masks are emtpy, a higher overall Dice score for model A does not automatically imply that it is better in segmentation than model B. In other words, a bad model that always predicts empty masks would have a higher Dice than the model that somewhat learns but has a lower Dice.
Maybe Dice might not be the best metric to report in this case. What about Average Precision (AP); see Figs 50 and 80 in the MetricsReloaded preprint.
Currently, when both the GT mask and the prediction mask are emtpy, we consider this special case and set, for e.g., the Dice score to 1 (as here). This is reasonable to some extent under the argument that if the prediciton is also emtpy when the GT is empty, then, the model might have learned correctly.
But, this might not always be the case. Take the following example -- if there are 3 images in the test set, 2 of them have no lesions and 1 has a lesion. We compare two models A and B. Let's say that model A is not good and predicts 3 empty masks. On the other hand, model B is better than A and predicts 1 empty mask, 2 non-empty masks. Now, because we're setting
Dice=1.0
when both GT and Pred lesions masks are emtpy, a higher overall Dice score for model A does not automatically imply that it is better in segmentation than model B. In other words, a bad model that always predicts empty masks would have a higher Dice than the model that somewhat learns but has a lower Dice.The issue is that there is no clear consensus on how to proceed with the segmentation metrics when both GT and preds are empty. The Anima toolbox is not helpful here because it skips the subjects with empty GT masks i.e. when running anima evaluation, the overall metrics are averaged only over the subjects with non-empty lesion masks (which is not correct as it would bias the segmentation metrics to be higher by ignoring the effect of potential false positives).
Opening this issue just a note/documentation that this is an open problem and users should be aware of this issue when evaluating their models. Tagging @valosekj who is also aware of this issue.