PatchCore validation takes long

Jia-Baos commented 2 years ago

Describe the bug

-when i using the patchcore to training data(MVTec bottle), there appeared some error, just like this---Validation: 0it [00:00, ?it/s], the process can't continue

To Reproduce

Steps to reproduce the behavior:

nothing

Expected behavior

C:\Users\fx50j.conda\envs\anomalib_env\python.exe D:/PythonProject/anomalib/tools/MyTest.py

1.12.0+cpu None None False 0

dataset:
  name: mvtec #options: [mvtec, btech, folder]
  format: mvtec
  path: D:/PythonProject/anomalib/datasets/MVTec
  task: segmentation
  category: bottle
  image_size: 224
  train_batch_size: 32
  test_batch_size: 1
  num_workers: 8
  transform_config:
    train: null
    val: null
  create_validation_set: false
  tiling:
    apply: false
    tile_size: null
    stride: null
    remove_border_count: 0
    use_random_tiling: False
    random_tile_count: 16

model:
  name: patchcore
  backbone: wide_resnet50_2
  pre_trained: true
  layers:
    - layer2
    - layer3
  coreset_sampling_ratio: 0.1
  num_neighbors: 9
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - F1Score
    - AUROC
  pixel:
    - F1Score
    - AUROC
  threshold:
    image_default: 0
    pixel_default: 0
    adaptive: true

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 0
  path: ./results

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: null # options: onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  accumulate_grad_batches: 1
  amp_backend: native
  auto_lr_find: false
  auto_scale_batch_size: false
  auto_select_gpus: false
  benchmark: false
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  default_root_dir: null
  detect_anomaly: false
  deterministic: false
  devices: 1
  enable_checkpointing: true
  enable_model_summary: true
  enable_progress_bar: true
  fast_dev_run: false
  gpus: null # Set automatically
  gradient_clip_val: 0
  ipus: null
  limit_predict_batches: 1.0
  limit_test_batches: 1.0
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  log_every_n_steps: 50
  log_gpu_memory: null
  max_epochs: 1
  max_steps: -1
  max_time: null
  min_epochs: null
  min_steps: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle
  num_nodes: 1
  num_processes: null
  num_sanity_val_steps: 0
  overfit_batches: 0.0
  plugins: null
  precision: 32
  profiler: null
  reload_dataloaders_every_n_epochs: 0
  replace_sampler_ddp: true
  strategy: null
  sync_batchnorm: false
  tpu_cores: null
  track_grad_norm: -1
  val_check_interval: 1.0 # Don't validate before extracting features.

Transform configs has not been provided. Images will be normalized using ImageNet statistics. Transform configs has not been provided. Images will be normalized using ImageNet statistics. C:\Users\fx50j.conda\envs\anomalib_env\lib\site-packages\torch\utils\data\dataloader.py:557: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4 (cpuset is not taken into account), which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( dict_keys(['image', 'image_path', 'label', 'mask_path', 'mask']) torch.Size([1, 3, 224, 224]) torch.Size([1, 224, 224]) C:\Users\fx50j.conda\envs\anomalib_env\lib\site-packages\torchmetrics\utilities\prints.py:36: UserWarning: Metric PrecisionRecallCurve will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint. warnings.warn(*args, *kwargs) D:\PythonProject\anomalib\anomalib\utils\callbacks__init__.py:133: UserWarning: Export option: None not found. Defaulting to no model export warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export") GPU available: False, used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used.. Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Trainer(limit_test_batches=1.0) was configured so 100% of the batches will be used.. Trainer(limit_predict_batches=1.0) was configured so 100% of the batches will be used.. Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch.. Missing logger folder: results\patchcore\mvtec\bottle\lightning_logs C:\Users\fx50j.conda\envs\anomalib_env\lib\site-packages\torchmetrics\utilities\prints.py:36: UserWarning: Metric ROC will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint. warnings.warn(args, **kwargs) C:\Users\fx50j.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\optimizer.py:183: UserWarning: LightningModule.configure_optimizers returned None, this fit will run with no optimizer rank_zero_warn(

| Name | Type | Params

0 | image_threshold | AdaptiveThreshold | 0
1 | pixel_threshold | AdaptiveThreshold | 0
2 | model | PatchcoreModel | 24.9 M 3 | image_metrics | AnomalibMetricCollection | 0
4 | pixel_metrics | AnomalibMetricCollection | 0
5 | normalization_metrics | MinMax | 0

24.9 M Trainable params 0 Non-trainable params 24.9 M Total params 99.450 Total estimated model params size (MB) C:\Users\fx50j.conda\envs\anomalib_env\lib\site-packages\torch\utils\data\dataloader.py:557: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4 (cpuset is not taken into account), which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( C:\Users\fx50j.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py:1933: PossibleUserWarning: The number of training batches (7) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. rank_zero_warn( Epoch 0: 1%| | 1/90 [01:07<1:40:34, 67.80s/it, loss=nan, v_num=0]C:\Users\fx50j.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py:137: UserWarning: training_step returned None. If this was on purpose, ignore this warning... self.warning_cache.warn("training_step returned None. If this was on purpose, ignore this warning...") Epoch 0: 8%|▊ | 7/90 [01:51<22:00, 15.91s/it, loss=nan, v_num=0] Validation: 0it [00:00, ?it/s]

Screenshots

If applicable, add screenshots to help explain your problem.

Hardware and Software Configuration

OS: [Ubuntu, OD]
NVIDIA Driver Version [470.57.02]
CUDA Version [e.g. 11.4]
CUDNN Version [e.g. v11.4.120]
OpenVINO Version [Optional e.g. v2021.4.2]

Additional context

Add any other context about the problem here.

samet-akcay commented 2 years ago

@Jia-Baos, I cannot reproduce this issue. Here is what I get when I run patchcore

Would it be because of your hardware configuration such that validation takes long time?

To double check this, you could change the model to

model:
  name: patchcore
  backbone: resnet18
  pre_trained: true
  layers:
    - layer2
    - layer3
  coreset_sampling_ratio: 0.1
  num_neighbors: 9
  normalization_method: min_max # options: [null, min_max, cdf]

or

model:
  name: patchcore
  backbone: resnet18
  pre_trained: true
  layers:
    - layer3
  coreset_sampling_ratio: 0.1
  num_neighbors: 9
  normalization_method: min_max # options: [null, min_max, cdf]

to make the model more lightweight.

Jia-Baos commented 2 years ago

Thank you so much, i have adopted your recommendations and changed the model, you're right, it needs to take a long time to validation..............

samet-akcay commented 2 years ago

We have just merged a PR #580, which partially addresses this. See #268 #533.

I'll be converting this to a Q&A in Discussions. Feel free to continue from there. Cheers!

openvinotoolkit / anomalib