openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.
https://anomalib.readthedocs.io/en/latest/
Apache License 2.0
3.63k stars 646 forks source link

Tile with PatchCore OutOfMemoryError: CUDA out of memory. #2207

Closed onkarkris closed 1 month ago

onkarkris commented 1 month ago

I am trying to run PatchCore with Tile operation as in the code input_tensor = self.tiler.tile(input_tensor)` I define self.tiler = Tiler(tile_size=448,stride=400) Input_tensor before tiling torch.Size([1, 3, 1024, 1024]) and after tiling torch.Size([9, 3, 448, 448])

As soon as training starts it gives error

Tile with PatchCore OutOfMemoryError: CUDA out of memory.

Config in my case `seed_everything: true trainer: accelerator: auto strategy: auto devices: auto num_nodes: 1 precision: null logger: null callbacks: null fast_dev_run: false max_epochs: null min_epochs: null max_steps: -1 min_steps: null max_time: null limit_train_batches: null limit_val_batches: null limit_test_batches: null limit_predict_batches: null overfit_batches: 0.0 val_check_interval: null check_val_every_n_epoch: 1 num_sanity_val_steps: null log_every_n_steps: null enable_checkpointing: null enable_progress_bar: null enable_model_summary: null accumulate_grad_batches: 1 gradient_clip_val: null gradient_clip_algorithm: null deterministic: null benchmark: null inference_mode: true use_distributed_sampler: true profiler: null detect_anomaly: false barebones: false plugins: null sync_batchnorm: false reload_dataloaders_every_n_epochs: 0 normalization: normalization_method: MIN_MAX task: SEGMENTATION metrics: image:

abc-125 commented 1 month ago

Hello, can you please try to run it on less images or with lower resolution? The problem is probably due to the high memory requirements of PatchCore, during training it processes extracted features from the whole training dataset.

onkarkris commented 1 month ago

Hello, can you please try to run it on less images or with lower resolution? The problem is probably due to the high memory requirements of PatchCore, during training it processes extracted features from the whole training dataset.

Thanks for your reply! In my case, the anomaly sizes are very small. Running on a lower resolution will reduce them to just a few pixels (10-20 pixels). I'm curious if anyone has successfully run PatchCore with tile operations.

abc-125 commented 1 month ago

Does it work for you without tiling? Again, it can be a PatchCore problem due to high memory requirements of this model.

alexriedel1 commented 1 month ago

you need more gpu memory i guess or smaller images resolution or less images. many images + high resolution = lots of gpu memory

JinYuannn commented 1 month ago

you can reduce coreset_sampling_ratio to 0.01 or 0.001, but this may lead to poor performance.

samet-akcay commented 1 month ago

I agree with the insights by @abc-125, @alexriedel1 and @JinYuannn. This is probably not an issue but more related to the nature of Patchcore.

You could potentially try tiling ensemble, but not sure about the current state of the PR. @blaz-r any insights? https://github.com/openvinotoolkit/anomalib/pull/1226

blaz-r commented 1 month ago

The tiled ensemble PR is still not 100% there. The training with v1 should work, but it's not all tested. I'm really busy right now but I'll try to get that sorted asap.

alexriedel1 commented 1 month ago

you can reduce coreset_sampling_ratio to 0.01 or 0.001, but this may lead to poor performance.

This probably won't help, because all the image embeddings need to fit on the gpu memory before the actual coreset sampling

onkarkris commented 1 month ago

@samet-akcay @alexriedel1 @blaz-r @abc-125

I think changing the hyperparameters can't solve the issue. By design, it's difficult to use tiling with PatchCore. I'm just wondering if anyone has successfully used tiling or ensemble tiling with PatchCore on 1024-sized images (50-60 images)?

Experiment Details --> Experiment 1: Anomalib Code used -- https://github.com/openvinotoolkit/anomalib/blob/d1f824a5798262891dbbe583fb291e1cf9aa7d2a/src/anomalib/models/image/patchcore/torch_model.py Line 75: input_tensor = self.tiler.tile(input_tensor)

self.tiler = Tiler(tile_size=128,stride=128) coreset_sampling_ratio: 0.01 Image before/after tiling: torch.Size([1, 3, 1024, 1024]) ---> torch.Size([64, 3, 128, 128]) Error: image

`self.tiler = Tiler(tile_size=248, stride=248) coreset_sampling_ratio: 0.001 Image before/after tiling: torch.Size([1, 3, 1024, 1024]) ---> torch.Size([25, 3, 248, 248]) Error: image

Experiment 2: Official PatchCore repository Code used -- Incorporated tiling operation in official repository of PatchCore https://github.com/amazon-science/patchcore-inspection

tiler = Tiler(tile_size=128,stride=128) coreset_sampling_ratio: 0.01 Image before/after tiling: torch.Size([1, 3, 1024, 1024]) ---> torch.Size([64, 3, 128, 128])` Error: image

Experiment 3 PatchCore ensemble as mentioned https://github.com/openvinotoolkit/anomalib/pull/1226/files/dac985fac8cba8644c56cf9c5dd58adaeb651afd#diff-21aa73a1cc1f2758f5b4d44477721b0b6fd00b78976dd91ea019dd9d95624e52 tiler = Tiler(tile_size=128,stride=128)

Error: `File "/home/ubuntu/anomalib_tile/tools/tiled_ensemble/ensemble_functions.py", line 73, in call coll_batch["image"] = tiled_images[self.tile_index] IndexError: index 8 is out of bounds for dimension 1 with size 8

alexriedel1 commented 1 month ago

@samet-akcay @alexriedel1 @blaz-r @abc-125

I think changing the hyperparameters can't solve the issue. By design, it's difficult to use tiling with PatchCore. I'm just wondering if anyone has successfully used tiling or ensemble tiling with PatchCore on 1024-sized images (50-60 images)?

Experiment Details --> Experiment 1: Anomalib Code used -- https://github.com/openvinotoolkit/anomalib/blob/d1f824a5798262891dbbe583fb291e1cf9aa7d2a/src/anomalib/models/image/patchcore/torch_model.py Line 75: input_tensor = self.tiler.tile(input_tensor)

self.tiler = Tiler(tile_size=128,stride=128) coreset_sampling_ratio: 0.01 Image before/after tiling: torch.Size([1, 3, 1024, 1024]) ---> torch.Size([64, 3, 128, 128]) Error: image

`self.tiler = Tiler(tile_size=248, stride=248) coreset_sampling_ratio: 0.001 Image before/after tiling: torch.Size([1, 3, 1024, 1024]) ---> torch.Size([25, 3, 248, 248]) Error: image

Experiment 2: Official PatchCore repository Code used -- Incorporated tiling operation in official repository of PatchCore https://github.com/amazon-science/patchcore-inspection

tiler = Tiler(tile_size=128,stride=128) coreset_sampling_ratio: 0.01 Image before/after tiling: torch.Size([1, 3, 1024, 1024]) ---> torch.Size([64, 3, 128, 128])` Error: image

Experiment 3 PatchCore ensemble as mentioned https://github.com/openvinotoolkit/anomalib/pull/1226/files/dac985fac8cba8644c56cf9c5dd58adaeb651afd#diff-21aa73a1cc1f2758f5b4d44477721b0b6fd00b78976dd91ea019dd9d95624e52 tiler = Tiler(tile_size=128,stride=128)

Error: `File "/home/ubuntu/anomalib_tile/tools/tiled_ensemble/ensemble_functions.py", line 73, in call coll_batch["image"] = tiled_images[self.tile_index] IndexError: index 8 is out of bounds for dimension 1 with size 8

Like said before the problem is your limited GPU memory. 60 images of 1024x1024 become 3840 tiles of 128x128 and thats a much. Try to find the maximum number of images your GPU memory can handle. Start with 5 of the 1024x1024 tiled images and then try to increase for example.

blaz-r commented 1 month ago

For such large images, the tiled ensemble solves the issue, I was able to train with 1024x1024 and tiles of 256x256 just fine. However, the current refactor for v1 is still not finished, I'll try to get that completed asap, in the meantime you can use the pre v1 version. However, I'm still trying to reproduce the bug you reported in #2073 .

onkarkris commented 1 month ago

Thanks! Using a tiled ensemble solved the issue. I can run PatchCore with all images at 1024x1024 after turning off the center crop as 2073

blaz-r commented 1 month ago

Great 😃

samet-akcay commented 1 month ago

Great, closing this as it can be followed here https://github.com/openvinotoolkit/anomalib/issues/1727