openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.
https://anomalib.readthedocs.io/en/latest/
Apache License 2.0
3.81k stars 678 forks source link

question about AdaptiveThreshold, possible bug? #457

Closed davidebaltieri31 closed 2 years ago

davidebaltieri31 commented 2 years ago

I'm running anomalib on a custom dataset (around 20k samples) and I've noticed costant crashes after a few epochs/validation steps (DefaultCPUAllocator: can't allocate memory...) almost independently of the modeI I choose.

It seem to be related to AdaptiveThreshold. In it PrecisionRecallCurve is used, and after every validation step new data is added to it, old data is never deleted. Why? Shouldn't data related to the previous validation steps be deleted?

ORippler commented 2 years ago

Since PrecisionRecallCurve works on all data/is computed on the complete data by default in torchmetrics, you will have to hoid all your datapoints in memory. For large datasets, you can look at the binned PR-Curve, implemented for constant-memory requirement by torchmetrics.

davidebaltieri31 commented 2 years ago

I think my post wasn't clear and I used the term "step" incorrectly. I understand that PrecisionRecallCurve works on all data, but why it keep data from previous epochs? I.e. I run training and validation after each epoch: Epoch 1 -> validation 1 -> epoch 2 -> validation 2 -> etc. which are I think the default settings in the repo config files, my question is why does AdaptiveThreshold uses PrecisionRecallCurve in a way that keep data both from validation 1 and 2? shouldn't it use all data from validation 2 only (which is validation on the latest trained model only)?

ORippler commented 2 years ago

I think my post wasn't clear and I used the term "step" incorrectly. I understand that PrecisionRecallCurve works on all data, but why it keep data from previous epochs? I.e. I run training and validation after each epoch: Epoch 1 -> validation 1 -> epoch 2 -> validation 2 -> etc. which are I think the default settings in the repo config files, my question is why does AdaptiveThreshold uses PrecisionRecallCurve in a way that keep data both from validation 1 and 2? shouldn't it use all data from validation 2 only (which is validation on the latest trained model only)?

Ahh I see. At a quick glance it seems as if the reset method is missing from AdaptiveThreshold.

Can you try to add:

def reset(self):
    self.precision_recall_curve.reset()

to the class and see if that fixes the bug/problem? Class can be found here. overall class should look as follows:

class AdaptiveThreshold(Metric):
    """Optimal F1 Metric.

    Compute the optimal F1 score at the adaptive threshold, based on the F1 metric of the true labels and the
    predicted anomaly scores.
    """

    def __init__(self, default_value: float = 0.5, **kwargs):
        super().__init__(**kwargs)

        self.precision_recall_curve = PrecisionRecallCurve(num_classes=1)
        self.add_state("value", default=torch.tensor(default_value), persistent=True)  # pylint: disable=not-callable
        self.value = torch.tensor(default_value)  # pylint: disable=not-callable

    # pylint: disable=arguments-differ
    def update(self, preds: torch.Tensor, target: torch.Tensor) -> None:  # type: ignore
        """Update the precision-recall curve metric."""
        self.precision_recall_curve.update(preds, target)

    def compute(self) -> torch.Tensor:
        """Compute the threshold that yields the optimal F1 score.

        Compute the F1 scores while varying the threshold. Store the optimal
        threshold as attribute and return the maximum value of the F1 score.

        Returns:
            Value of the F1 score at the optimal threshold.
        """
        precision: torch.Tensor
        recall: torch.Tensor
        thresholds: torch.Tensor

        precision, recall, thresholds = self.precision_recall_curve.compute()
        f1_score = (2 * precision * recall) / (precision + recall + 1e-10)
        if thresholds.dim() == 0:
            # special case where recall is 1.0 even for the highest threshold.
            # In this case 'thresholds' will be scalar.
            self.value = thresholds
        else:
            self.value = thresholds[torch.argmax(f1_score)]
        return self.value
    # ADDED METHOD
    def reset(self):
        self.precision_recall_curve.reset()

Note: Same will be necessary for OptimalF1 also

davidebaltieri31 commented 2 years ago

Hi, added the reset code to both, but it's not called. I think there are probably other changes that need to be made, probably in anomaly_module or somewhere else in anomalib?

ORippler commented 2 years ago

Hi, added the reset code to both, but it's not called. I think there are probably other changes that need to be made, probably in anomaly_module or somewhere else in anomalib?

Did you install anomalib with the -e flag, i.e. via pip install -e?

If so, please provide a MWE so that the bug can be localized.

jpcbertoldo commented 2 years ago

Shouldn't this one be closed?

djdameln commented 2 years ago

I believe that the original issue is not yet resolved. While the reset method has been added to the AdaptiveThreshold metric class, the method is never called. This means that for models that need to be trained for multiple epochs, the data from the previous epochs is never cleared, which in turn may lead to out of memory errors. It could also lead to inaccurate computation of the adaptive threshold. We will investigate this a bit further and work on a fix.