openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.
https://anomalib.readthedocs.io/en/latest/
Apache License 2.0
3.68k stars 654 forks source link

`Trainer.fit` stopped: `max_epochs=1` reached. Epoch 0/0 ------------------ 2/2 0:00:15 • 0:00:00 6.37it/s pixel_AUROC: 0.000 pixel_F1Score: 0.000 #2141

Closed yxl23 closed 2 months ago

yxl23 commented 3 months ago

Describe the bug

E:\Andconda3\envs\yolov10-main\python.exe D:/shenduxuexi/anomalib/xunlian.py
dict_keys(['image_path', 'label', 'image', 'mask'])
torch.Size([32, 3, 640, 640]) torch.Size([32, 640, 640])
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3080 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\core\optimizer.py:182: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer
┌───┬───────────────────────┬──────────────────────────┬────────┬───────┐
│   │ Name                  │ Type                     │ Params │ Mode  │
├───┼───────────────────────┼──────────────────────────┼────────┼───────┤
│ 0 │ model                 │ PadimModel               │  2.8 M │ train │
│ 1 │ _transform            │ Compose                  │      0 │ train │
│ 2 │ normalization_metrics │ MinMax                   │      0 │ train │
│ 3 │ image_threshold       │ F1AdaptiveThreshold      │      0 │ train │
│ 4 │ pixel_threshold       │ F1AdaptiveThreshold      │      0 │ train │
│ 5 │ image_metrics         │ AnomalibMetricCollection │      0 │ train │
│ 6 │ pixel_metrics         │ AnomalibMetricCollection │      0 │ train │
└───┴───────────────────────┴──────────────────────────┴────────┴───────┘
Trainable params: 2.8 M                                                        
Non-trainable params: 0                                                        
Total params: 2.8 M                                                            
Total estimated model params size (MB): 11                                     
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:419: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:419: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\loops\optimization\automatic.py:132: `training_step` returned `None`. If this was on purpose, ignore this warning...
E:\Andconda3\envs\yolov10-main\lib\site-packages\anomalib\models\components\filters\blur.py:91: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ..\aten\src\ATen\native\cudnn\Conv_v8.cpp:919.)
  output = F.conv2d(input_tensor, self.kernel, groups=self.channels, padding=0, stride=1)
WARNING:root:The validation set does not contain any anomalous images. As a result, the adaptive threshold will take the value of the highest anomaly score observed in the normal validation images, which may lead to poor predictions. For a more reliable adaptive threshold computation, please add some anomalous images to the validation set.
E:\Andconda3\envs\yolov10-main\lib\site-packages\torchmetrics\utilities\prints.py:43: UserWarning: No positive samples found in target, recall is undefined. Setting recall to one for all thresholds.
  warnings.warn(*args, **kwargs)  # noqa: B028
E:\Andconda3\envs\yolov10-main\lib\site-packages\torchmetrics\utilities\prints.py:43: UserWarning: No positive samples in targets, true positive value should be meaningless. Returning zero tensor in true positive score
  warnings.warn(*args, **kwargs)  # noqa: B028
`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0/0  ------------------ 2/2 0:00:15 • 0:00:00 6.37it/s pixel_AUROC: 0.000
                                                             pixel_F1Score:    
                                                             0.000             
WARNING:anomalib.metrics.f1_score:F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
WARNING:anomalib.metrics.f1_score:F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
Restoring states from the checkpoint path at D:\shenduxuexi\anomalib\results\Padim\shuju\v8\weights\lightning\model.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at D:\shenduxuexi\anomalib\results\Padim\shuju\v8\weights\lightning\model.ckpt
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:419: Consider setting `persistent_workers=True` in 'test_dataloader' to speed up the dataloader worker initialization.
┌───────────────────────────┬───────────────────────────┐
│        Test metric        │       DataLoader 0        │
├───────────────────────────┼───────────────────────────┤
│        image_AUROC        │            1.0            │
│       image_F1Score       │    0.9433962106704712     │
│        pixel_AUROC        │            0.0            │
│       pixel_F1Score       │            0.0            │
└───────────────────────────┴───────────────────────────┘
Testing -------------------------------------- 2/2 0:00:10 • 0:00:00 0.10it/s 

Process finished with exit code 0

Dataset

N/A

Model

N/A

Steps to reproduce the behavior

from typing import Any

import numpy as np from matplotlib import pyplot as plt from PIL import Image from torchvision.transforms import ToPILImage

from anomalib import TaskType from anomalib.data import Folder from anomalib.data.utils import read_image from anomalib.deploy import OpenVINOInferencer, ExportType from anomalib.engine import Engine from anomalib.models import Padim

if name == 'main': datamodule = Folder(num_workers=8, name='shuju', root='shuju', mask_dir='mask/ng', normal_dir='good', abnormal_dir='ng', task=TaskType.SEGMENTATION, image_size=[640, 640]) datamodule.prepare_data() # Downloads the dataset if it's not in the specified root directory datamodule.setup() # Create train/val/test/prediction sets. i, data = next(enumerate(datamodule.val_dataloader())) print(data.keys()) print(data["image"].shape, data["mask"].shape)

def show_image_and_mask(sample: dict[str, Any], index: int) -> Image:
    """Show an image with a mask.

    Args:
        sample (dict[str, Any]): Sample from the dataset.
        index (int): Index of the sample.

    Returns:
        Image: Output image with a mask.
    """
    # Load the image from the path
    image = Image.open(sample["image_path"][index])

    # Load the mask and convert it to RGB
    mask = ToPILImage()(sample["mask"][index]).convert("RGB")

    # Resize mask to match image size, if they differ
    if image.size != mask.size:
        mask = mask.resize(image.size)

    combined_image =Image.fromarray(np.hstack((np.array(image), np.array(mask))))
    return combined_image
# Visualize an image with a mask
image_with_mask = show_image_and_mask(data, index=0)

# 使用matplotlib显示图像
plt.imshow(image_with_mask)
plt.axis('off')  # 关闭坐标轴
plt.show()

model = Padim()
datamodule = datamodule
engine = Engine(task=TaskType.SEGMENTATION)
engine.fit(model=model, datamodule=datamodule)
test_results = engine.test(
    model=model,
    datamodule=datamodule,
    ckpt_path=engine.trainer.checkpoint_callback.best_model_path,
)

OS information

OS information:

Expected behavior

no have

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

Latest version

Configuration YAML

no have

Logs

E:\Andconda3\envs\yolov10-main\python.exe D:/shenduxuexi/anomalib/xunlian.py
dict_keys(['image_path', 'label', 'image', 'mask'])
torch.Size([32, 3, 640, 640]) torch.Size([32, 640, 640])
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3080 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\core\optimizer.py:182: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer
┌───┬───────────────────────┬──────────────────────────┬────────┬───────┐
│   │ Name                  │ Type                     │ Params │ Mode  │
├───┼───────────────────────┼──────────────────────────┼────────┼───────┤
│ 0 │ model                 │ PadimModel               │  2.8 M │ train │
│ 1 │ _transform            │ Compose                  │      0 │ train │
│ 2 │ normalization_metrics │ MinMax                   │      0 │ train │
│ 3 │ image_threshold       │ F1AdaptiveThreshold      │      0 │ train │
│ 4 │ pixel_threshold       │ F1AdaptiveThreshold      │      0 │ train │
│ 5 │ image_metrics         │ AnomalibMetricCollection │      0 │ train │
│ 6 │ pixel_metrics         │ AnomalibMetricCollection │      0 │ train │
└───┴───────────────────────┴──────────────────────────┴────────┴───────┘
Trainable params: 2.8 M                                                        
Non-trainable params: 0                                                        
Total params: 2.8 M                                                            
Total estimated model params size (MB): 11                                     
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:419: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:419: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\loops\optimization\automatic.py:132: `training_step` returned `None`. If this was on purpose, ignore this warning...
E:\Andconda3\envs\yolov10-main\lib\site-packages\anomalib\models\components\filters\blur.py:91: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ..\aten\src\ATen\native\cudnn\Conv_v8.cpp:919.)
  output = F.conv2d(input_tensor, self.kernel, groups=self.channels, padding=0, stride=1)
WARNING:root:The validation set does not contain any anomalous images. As a result, the adaptive threshold will take the value of the highest anomaly score observed in the normal validation images, which may lead to poor predictions. For a more reliable adaptive threshold computation, please add some anomalous images to the validation set.
E:\Andconda3\envs\yolov10-main\lib\site-packages\torchmetrics\utilities\prints.py:43: UserWarning: No positive samples found in target, recall is undefined. Setting recall to one for all thresholds.
  warnings.warn(*args, **kwargs)  # noqa: B028
E:\Andconda3\envs\yolov10-main\lib\site-packages\torchmetrics\utilities\prints.py:43: UserWarning: No positive samples in targets, true positive value should be meaningless. Returning zero tensor in true positive score
  warnings.warn(*args, **kwargs)  # noqa: B028
`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0/0  ------------------ 2/2 0:00:15 • 0:00:00 6.37it/s pixel_AUROC: 0.000
                                                             pixel_F1Score:    
                                                             0.000             
WARNING:anomalib.metrics.f1_score:F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
WARNING:anomalib.metrics.f1_score:F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
Restoring states from the checkpoint path at D:\shenduxuexi\anomalib\results\Padim\shuju\v8\weights\lightning\model.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at D:\shenduxuexi\anomalib\results\Padim\shuju\v8\weights\lightning\model.ckpt
E:\Andconda3\envs\yolov10-main\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:419: Consider setting `persistent_workers=True` in 'test_dataloader' to speed up the dataloader worker initialization.
┌───────────────────────────┬───────────────────────────┐
│        Test metric        │       DataLoader 0        │
├───────────────────────────┼───────────────────────────┤
│        image_AUROC        │            1.0            │
│       image_F1Score       │    0.9433962106704712     │
│        pixel_AUROC        │            0.0            │
│       pixel_F1Score       │            0.0            │
└───────────────────────────┴───────────────────────────┘
Testing -------------------------------------- 2/2 0:00:10 • 0:00:00 0.10it/s

Code of Conduct

alexriedel1 commented 3 months ago

What do you image masks look like? How do the output images look like? If you cant show them: are you sure the groundtruth masks are correct?

yxl23 commented 3 months ago

My mask is correct, but it is very small because my industrial defects are very small

krupeshp commented 2 months ago

My mask is correct, but it is very small because my industrial defects are very small

The same case with me. Even we can't fine-tune or train the model more than 1 epoch. Do you know if there is any solution?

samet-akcay commented 2 months ago

My mask is correct, but it is very small because my industrial defects are very small

The same case with me. Even we can't fine-tune or train the model more than 1 epoch. Do you know if there is any solution?

@krupeshp, if you are using Padim model you should not train the model more than 1 epoch. The model does not need any training or fine-tuning, it just needs 1 epoch to go over the dataset and extract the features.

ashwinvaidya17 commented 2 months ago

I am closing this as I wouldn't categorize this as a bug. The issue seems very specific to the dataset. The Padim model passes the internal regression tests, and from the logs it looks like only the pixel-level performance is poor. It might be related to the dataset. If it is still an issue, we can continue in the discussions page. Also, a few examples of the dataset will help inform those discussions.