[Bug]: Training with PADiM does not care about "max_epochs" parameter - Githubissues

openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.

https://anomalib.readthedocs.io/en/latest/

Apache License 2.0

3.4k stars 614 forks source link

[Bug]: Training with PADiM does not care about "max_epochs" parameter #2134

Closed haimat closed 3 weeks ago

haimat commented 3 weeks ago

Describe the bug

I want to train a PADiM model via anomalib, however, it always stops after the first epoch. But when creating the Engine() object I pass max_epochs=100, see below.

Dataset

Other (please specify in the text field below)

Model

PADiM

Steps to reproduce the behavior

I use the following training script:

from anomalib.models import EfficientAd
from anomalib.deploy import ExportType
from anomalib.engine import Engine
from anomalib.data import Folder
from anomalib import TaskType

import os

def train():
    task_type = TaskType.CLASSIFICATION
    input_size = (256, 256)
    root_folder = "/data/scratch/anomalib"

    # Create the datamodule
    datamodule = Folder(
        name="Test",
        root=os.path.join(root_folder, "images"),
        normal_dir="normal",
        abnormal_dir="abnormal",
        image_size=input_size,
        task=task_type,
        train_batch_size=1,
    )
    datamodule.prepare_data()
    datamodule.setup()

    # Create the model
    model = EfficientAd()
    engine = Engine(
        max_epochs=100,
        task=task_type,
        accelerator="gpu",
        devices=-1,
        callbacks=[],
    )
    engine.fit(datamodule=datamodule, model=model)
    engine.test(datamodule=datamodule, model=model, ckpt_path=engine.trainer.checkpoint_callback.best_model_path)

    # Export the model
    engine.export(model=model, export_type=ExportType.ONNX, export_root=root_folder)

if __name__ == "__main__":
    train()

OS information

OS information:

OS: Ubuntu 22.04
Python version: 3.10.12
Anomalib version: 1.1.0
PyTorch version: 2.2.0
CUDA/cuDNN version: 12.2
GPU models and configuration: 1x GeForce RTX 4080
Any other relevant information: I am using a custom dataset via anomalib.data.Folder

Expected behavior

Since I pass max_epochs=100 I would expect the training not to stop after the first epoch with the message "max_epochs=1 reached."

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

Logs

Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelS
ummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 4080 SUPER') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('mediu
m' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#t
orch.set_float32_matmul_precision
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py:181: `LightningModule.configure_optimizers` returned `None`, this fit will 
run with no optimizer
┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name                  ┃ Type                     ┃ Params ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ model                 │ PadimModel               │  2.8 M │
│ 1 │ _transform            │ Compose                  │      0 │
│ 2 │ normalization_metrics │ MinMax                   │      0 │
│ 3 │ image_threshold       │ F1AdaptiveThreshold      │      0 │
│ 4 │ pixel_threshold       │ F1AdaptiveThreshold      │      0 │
│ 5 │ image_metrics         │ AnomalibMetricCollection │      0 │
│ 6 │ pixel_metrics         │ AnomalibMetricCollection │      0 │
└───┴───────────────────────┴──────────────────────────┴────────┘
Trainable params: 2.8 M
Non-trainable params: 0
Total params: 2.8 M
Total estimated model params size (MB): 11
/home/sinntelligence/.local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py:132: `training_step` returned `None`. If this was on purpose,
ignore this warning...
Epoch 0/0  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0:00:32 • 0:00:00 2.91it/s  `Trainer.fit` stopped: `max_epochs=1` reached.

Code of Conduct

[X] I agree to follow this project's Code of Conduct

samet-akcay commented 3 weeks ago

Hi @haimat, this is because PADIM requires only 1 epoch to go through the entire dataset once. Increasing the number of epochs only replicates the same process and wouldn't improve the performance. That's why we hardcode the number of epochs to 1.