openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.
https://anomalib.readthedocs.io/en/latest/
Apache License 2.0
3.68k stars 654 forks source link

[Bug]: tilling with padim collapse #2142

Open lzd-1230 opened 3 months ago

lzd-1230 commented 3 months ago

Describe the bug

here is code for training

from pathlib import Path
import numpy as np
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint
from PIL import Image
from torchvision import transforms
import torch

from anomalib.data import  Folder
from anomalib.engine import Engine
from anomalib.models import Padim
from anomalib import TaskType
from anomalib.callbacks import TilerConfigurationCallback

dataset_root = Path.cwd() / "ad-half-data" / "up"
task = TaskType.SEGMENTATION

datamodule = Folder(
    root=dataset_root,
    name="phone-half",
    normal_dir="good-1024-s",
    abnormal_dir="flaw-1024",
    mask_dir="mask/flaw-1024",
    train_batch_size=1,
    eval_batch_size=1,
    num_workers=30,
    image_size=(1024, 1024),
    task=task,
)

model = Padim(backbone="wide_resnet50_2", pre_trained=True, n_features=550)

callbacks = [
    ModelCheckpoint(
        mode="max",
        monitor="pixel_F1Score",
    ),
    EarlyStopping(
        monitor="pixel_F1Score",
        mode="max",
        patience=3,
    ),
    TilerConfigurationCallback(enable=True, 
                               tile_size=256, 
                               stride=256)
]

engine = Engine(
    callbacks=callbacks,
    pixel_metrics=["F1Score", "AUROC"],
    accelerator="auto",  # \<"cpu", "gpu", "tpu", "ipu", "hpu", "auto">,
    devices=1,
    logger=False,
)

engine.train(datamodule=datamodule, model=model)

I got 150 good pictures in good-1024-s for training, and after I run this script the ssh just lost and seems collapse for some reason without tips.

image

Dataset

Folder

Model

PADiM

Steps to reproduce the behavior

run the code with same 150 1024*1024 imgs

OS information

OS information:

Expected behavior

expected to train

Screenshots

No response

Pip/GitHub

GitHub

What version/branch did you use?

No response

Configuration YAML

None

Logs

image

Code of Conduct

abc-125 commented 3 months ago

Can you try it with a lower resolution or fewer images? It can be an out-of-memory error.

lzd-1230 commented 3 months ago

Can you try it with a lower resolution or fewer images? It can be an out-of-memory error.

Yeah, I try to use 30 imgs(1024*1024) to do the training, and the ssh doesn't crash and I got the following logs

Traceback (most recent call last):
  File "/home/lzd/patchcore-inspection/anomalib/train-padim.py", line 56, in <module>
    engine.train(datamodule=datamodule, model=model)
  File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/engine/engine.py", line 863, in train
    self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 141, in run
    self.on_advance_end(data_fetcher)
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 295, in on_advance_end
    self.val_loop.run()
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 114, in run
    self.on_run_start()
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 244, in on_run_start
    self._on_evaluation_start()
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 290, in _on_evaluation_start
    call._call_lightning_module_hook(trainer, hook_name, *args, **kwargs)
  File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/base/memory_bank_module.py", line 37, in on_validation_start
    self.fit()
  File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/image/padim/lightning_model.py", line 86, in fit
    self.stats = self.model.gaussian.fit(embeddings)
  File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/stats/multi_variate_gaussian.py", line 136, in fit
    return self.forward(embedding)
  File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/stats/multi_variate_gaussian.py", line 117, in forward
    covariance = torch.zeros(size=(channel, channel, height * width), device=device)
RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 79298560000 bytes. Error code 12 (Cannot allocate memory)

I'm using tilling because the model is not good when using high resolution images, but there seems not support well for padim, and I can successfully tilling with patchcore

abc-125 commented 3 months ago

Does Padim work if you use it without tiling? It could be just different memory requirements for Padim and PatchCore.

blaz-r commented 2 months ago

I think this indeed is an out of memory issue, but it's rather unusual that PatchCore works and Padim doesn't.