[Bug]: The same error occurs when using Anomalib's fastflow, Draem, and other models that need to be trained many epochs

Describe the bug

The same error occurs when using Anomalib's fastflow, Draem, and other models that need to be trained many epochs

Dataset

Other (please specify in the text field below)

Model

FastFlow

Steps to reproduce the behavior

1： Download the latest anomalib
2： Install pytorch 11.3+cu13 （My GPU: Nvidia RTX 2080Ti -- 13G）
3： pip install -e .
4： pip install -r requirement.txt

OS information

OS： win10
IDE： Pycharm
GPU： 2080Ti
pytorch： 1.12.1(GPU)
Anomalib version： latest（Until April 21, 2023）

Expected behavior

Many thanks to the authors involved for open sourcing this great library Anomalib. I think it is a milestone in the defect detection field, it is a great work and congratulations to them for the result.

I have successfully trained on my own dataset using Padim, patchCore, etc. and have achieved good results.

But unfortunately, I'm getting a lot of the same errors when training my own data with fastflow, a model that Draem needs to iterate through multiple updates.

Below I will use fastflow's related logs to illustrate

I made changes to fastflow's config.Yaml, but the changes I made were limited to the dataset section, as follows:

dataset:
  name: mydata
  format: folder
  path: ./Mydatasets/cubes
  normal_dir: normal # 独有 name of the folder containing normal images.
  abnormal_dir: abnormal # 独有 name of the folder containing abnormal images.
  task: classification
  normal_test_dir: null
  mask: null
  extensions: null
  image_size: 256 # dimensions to which images are resized (mandatory)
  center_crop: 224 # dimensions to which images are center-cropped after resizing (optional)
  normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
  train_batch_size: 32
  test_batch_size: 32
  transform_config:
    train: null
    eval: null
  num_workers: 4
  test_split_mode: from_dir # options: [from_dir, synthetic]
  test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
  val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
  val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
  tiling:
    apply: false
    tile_size: null
    stride: null
    remove_border_count: 0
    use_random_tiling: False
    random_tile_count: 16

model:
  name: fastflow
  backbone: wide_resnet50_2 # options: [resnet18, wide_resnet50_2, cait_m48_448, deit_base_distilled_patch16_384]
  pre_trained: true
  flow_steps: 8 # options: [8, 8, 20, 20] - for each supported backbone
  hidden_ratio: 1.0 # options: [1.0, 1.0, 0.16, 0.16] - for each supported backbone
  conv3x3_only: True # options: [True, False, False, False] - for each supported backbone
  lr: 0.001
  weight_decay: 0.00001
  early_stopping:
    patience: 3
    metric: image_F1Score    #这里我对其进行了限制 'pixel_AUROC' `train_loss`, `train_loss_step`, `image_F1Score`, `image_AUROC`, `train_loss_epoch`
    mode: max
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - F1Score
    - AUROC
  pixel:
    - F1Score
    - AUROC
  threshold:
    method: adaptive #options: [adaptive, manual]
    manual_image: null
    manual_pixel: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 42
  path: ./results

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: null # options: torch, onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 1
  devices: 1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 500
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0 # Don't validate before extracting features.
  log_every_n_steps: 50
  accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle

Then I gleefully picked up my coffee and prepared to wait for it to run the 500 times I had set for training. But it runs for a while and then reports an error directly:

CB2E73B

So I followed the instructions and found the corresponding place in the Yaml file and made the changes （Changed pixel_AUROC to image_AUROC in matric）

 early_stoping:
          patience: 3
          metric:  image_AUROC # `train_loss`, `train_loss_step`, `image_F1Score`, `image_AUROC`, `train_loss_epoch`

But fastflow it only trained three rounds on the error, and his loss function is still very high (but DRAEM I so modified after training more than 40 epochs, the loss function down to 0.17 or so)

222222 333333

I think it is extremely unreasonable that his loss function still has 7.64e+04 at the end of the training

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

Logs

| Name                  | Type                     | Params
-------------------------------------------------------------------
0 | image_threshold       | AnomalyScoreThreshold    | 0     
1 | pixel_threshold       | AnomalyScoreThreshold    | 0     
2 | model                 | FastflowModel            | 9.5 M 
3 | loss                  | FastflowLoss             | 0     
4 | image_metrics         | AnomalibMetricCollection | 0     
5 | pixel_metrics         | AnomalibMetricCollection | 0     
6 | normalization_metrics | MinMax                   | 0     
-------------------------------------------------------------------
5.4 M     Trainable params
4.2 M     Non-trainable params
9.5 M     Total params
38.076    Total estimated model params size (MB)
Epoch 0:   0%|          | 0/6 [00:00<?, ?it/s] C:\ProgramData\anaconda3\envs\HC_Anomalib\lib\site-packages\pytorch_lightning\core\module.py:481: UserWarning: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
  rank_zero_warn(
Epoch 0:  50%|█████     | 3/6 [00:16<00:16,  5.45s/it, loss=1.55e+05, train_loss_step=1.33e+5]
Validation: 0it [00:00, ?it/s]
Validation:   0%|          | 0/3 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s]
Epoch 0:  67%|██████▋   | 4/6 [00:31<00:15,  7.78s/it, loss=1.55e+05, train_loss_step=1.33e+5]
Epoch 0:  83%|████████▎ | 5/6 [00:31<00:06,  6.24s/it, loss=1.55e+05, train_loss_step=1.33e+5]
Epoch 0: 100%|██████████| 6/6 [00:31<00:00,  5.22s/it, loss=1.55e+05, train_loss_step=1.33e+5, image_F1Score=1.000, image_AUROC=1.000]
C:\ProgramData\anaconda3\envs\HC_Anomalib\lib\site-packages\torchmetrics\utilities\prints.py:36: DeprecationWarning: `torchmetrics.functional.auc` has been move to `torchmetrics.utilities.compute` in v0.10 and will be removed in v0.11.
  warnings.warn(*args, **kwargs)
Epoch 1:  50%|█████     | 3/6 [00:15<00:15,  5.22s/it, loss=1.27e+05, train_loss_step=8.21e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=1.56e+5]
Validation: 0it [00:00, ?it/s]
Validation:   0%|          | 0/3 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s]
Epoch 1:  67%|██████▋   | 4/6 [00:30<00:15,  7.73s/it, loss=1.27e+05, train_loss_step=8.21e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=1.56e+5]
Epoch 1:  83%|████████▎ | 5/6 [00:30<00:06,  6.20s/it, loss=1.27e+05, train_loss_step=8.21e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=1.56e+5]
Epoch 1: 100%|██████████| 6/6 [00:31<00:00,  5.18s/it, loss=1.27e+05, train_loss_step=8.21e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=1.56e+5]
Epoch 2:  50%|█████     | 3/6 [00:15<00:15,  5.21s/it, loss=1.01e+05, train_loss_step=3.22e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=9.92e+4]
Validation: 0it [00:00, ?it/s]
Validation:   0%|          | 0/3 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s]
Epoch 2:  67%|██████▋   | 4/6 [00:31<00:15,  7.77s/it, loss=1.01e+05, train_loss_step=3.22e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=9.92e+4]
Epoch 2:  83%|████████▎ | 5/6 [00:31<00:06,  6.23s/it, loss=1.01e+05, train_loss_step=3.22e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=9.92e+4]
Epoch 2: 100%|██████████| 6/6 [00:31<00:00,  5.20s/it, loss=1.01e+05, train_loss_step=3.22e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=9.92e+4]
Epoch 3:  50%|█████     | 3/6 [00:16<00:16,  5.36s/it, loss=7.64e+04, train_loss_step=-1.04e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=4.87e+4]
Validation: 0it [00:00, ?it/s]
Validation:   0%|          | 0/3 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s]
Epoch 3:  67%|██████▋   | 4/6 [00:31<00:15,  7.93s/it, loss=7.64e+04, train_loss_step=-1.04e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=4.87e+4]
Epoch 3:  83%|████████▎ | 5/6 [00:31<00:06,  6.36s/it, loss=7.64e+04, train_loss_step=-1.04e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=4.87e+4]
Epoch 3: 100%|██████████| 6/6 [00:31<00:00,  5.31s/it, loss=7.64e+04, train_loss_step=-1.04e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=4.87e+4]
Epoch 3: 100%|██████████| 6/6 [00:31<00:00,  5.31s/it, loss=7.64e+04, train_loss_step=-1.04e+4, image_F1Score=1.000, image_AUROC=1.000, train_loss_epoch=3.62e+3]
2023-04-21 18:06:11,485 - anomalib.utils.callbacks.timer - INFO - Training took 128.13 seconds
2023-04-21 18:06:11,485 - anomalib - INFO - Loading the best model weights.
2023-04-21 18:06:11,485 - anomalib - INFO - Testing the model.
2023-04-21 18:06:11,501 - pytorch_lightning.utilities.rank_zero - INFO - The following callbacks returned in `LightningModule.configure_callbacks` will override existing callbacks passed to Trainer: EarlyStopping
2023-04-21 18:06:11,501 - anomalib.utils.callbacks.model_loader - INFO - Loading the model from D:\MyCode\4_HC\Time\202304\Anomalib_Seg\anomalib-main\tools\results\fastflow\mydata\run\weights\lightning\model-v11.ckpt
2023-04-21 18:06:11,594 - anomalib.utils.callbacks.metrics_configuration - WARNING - Cannot perform pixel-level evaluation when task type is classification. Ignoring the following pixel-level metrics: ['F1Score', 'AUROC']
2023-04-21 18:06:11,641 - pytorch_lightning.accelerators.cuda - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s]D:\MyCode\4_HC\Time\202304\Anomalib_Seg\anomalib-main\src\anomalib\post_processing\visualizer.py:264: MatplotlibDeprecationWarning: Support for FigureCanvases without a required_interactive_framework attribute was deprecated in Matplotlib 3.6 and will be removed two minor releases later.
  self.figure, self.axis = plt.subplots(1, num_cols, figsize=figure_size)
Testing DataLoader 0: 100%|██████████| 3/3 [00:12<00:00,  4.12s/it]2023-04-21 18:06:38,785 - anomalib.utils.callbacks.timer - INFO - Testing took 27.11213994026184 seconds
Throughput (batch_size=32) : 3.5039653900179197 FPS
D:\MyCode\4_HC\Time\202304\Anomalib_Seg\anomalib-main\src\anomalib\utils\metrics\plotting_utils.py:48: MatplotlibDeprecationWarning: Support for FigureCanvases without a required_interactive_framework attribute was deprecated in Matplotlib 3.6 and will be removed two minor releases later.
  fig, axis = plt.subplots()
Testing DataLoader 0: 100%|██████████| 3/3 [00:12<00:00,  4.15s/it]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       image_AUROC                  1.0
      image_F1Score                 1.0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

进程已结束,退出代码0

Code of Conduct

[X] I agree to follow this project's Code of Conduct

openvinotoolkit / anomalib