[Bug]: Patchcore exported ONNX file is not usable

openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.

https://anomalib.readthedocs.io/en/latest/

Apache License 2.0

3.82k stars 679 forks source link

[Bug]: Patchcore exported ONNX file is not usable #967

Closed MhdKAT closed 1 year ago

MhdKAT commented 1 year ago

Describe the bug

Hi all! Has anyone tried to do inference with the Patchcore exported ONNX from anomalib with onnxruntime for example? The model is apprently buggy as it is asking for an insane amount of memory (haven't been able to run it on 80GB machine on CPU for example). The error I keep getting is :
onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632. The problematic layer is apparently this sub node. Does anyone has any clue how to fix it? or is there any workaround?

Dataset

Folder

Model

PatchCore

Steps to reproduce the behavior

1 - Install Anomalib 2 - Train a Patchcore model 3 - Try to infer with the ONNX

OS information

OS information:

OS: [e.g. Ubuntu 20.04]
Python version: [e.g. 3.8.10]
Anomalib version: [e.g. 0.4.0]
PyTorch version: [e.g. 1.9.0]
CUDA/cuDNN version: [e.g. 11.4]
GPU models and configuration: [1x A100]

Expected behavior

exported ONNX model from patchcore is not working.

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

0.4.0

Configuration YAML

model:
  name: patchcore
  backbone: wide_resnet50_2
  pre_trained: true
  layers:
    - layer2
    - layer3
  coreset_sampling_ratio: 0.1
  num_neighbors: 9
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - F1Score
    - AUROC
  pixel:
    - F1Score
    - AUROC
  threshold:
    method: adaptive #options: [adaptive, manual]
    manual_image: null
    manual_pixel: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 42
  path: ./results

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: onnx #options: onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 1
  devices: 1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 1
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0 # Don't validate before extracting features.
  log_every_n_steps: 50
  accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle

Logs

onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632.

Code of Conduct

[X] I agree to follow this project's Code of Conduct

blaz-r commented 1 year ago

Hello, I'm not sure what the problem is, but I can reproduce this. It definitely isn't normal that model requests 288gigs of memory, but I'm not entirely sure if that node is the only problem.

dka-lmis commented 1 year ago

I've encountered the same issue,

My current workaround is to either use the .ckpt file with the pytorch lightning interpreter or extract the memory_bank as a numpy array, define a custom PatchCoreModel wrapper and manually load & store the memory_bank as a tensor.

I was only able to run the ONNX file on a server with 64 GB RAM when using an input size of 224 and a sampling rate of 0.01, so that I get ~6.7k entries (shape of 6700x1536) for the memory_bank.

Using different Docker images or Interpreters made no difference (OpenVINO or ONNX Runtime) Tried these with a 3090: https://github.com/microsoft/onnxruntime/blob/main/dockerfiles/Dockerfile.cuda nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04

hgaiser commented 1 year ago

I'm running into the same issue, trying to debug what is happening. I noticed that my memory bank is shaped [48000, 1536] (images are shaped [400, 400]) and that the amount of memory it tries to allocate is 737280000000 bytes. It probably isn't a coincidence that 48000 * 1536 = 73728000. I would expect it to allocate 48000 * 1536 * 4 bytes though, as the memory bank has dtype float32, not 48000 * 1536 * 10000 bytes. I'm trying to find out where this issue comes from .. I'll let you know if I find anything, but wanted to share my findings in the meantime.

hgaiser commented 1 year ago

I've narrowed it down quite a bit to this line:

https://github.com/openvinotoolkit/anomalib/blob/main/src/anomalib/models/patchcore/torch_model.py#L191

It seems that in the ONNX representation (and apparently also OpenVINO?), the input for cdist is shaped 1, 48000, 1536 (called %onnx::Sub_770 in ONNX), whereas it is shaped 48000, 1536 according to pytorch. The other input (%/Reshape_output_0) is shaped 2500, 1536, but seems to get unsqueezed to 2500, 1, 1536 (presumably to match the other input). The calculated output shape of cdist is then 2500, 48000, 1536, which is way too large.

I believe the Sub_770 tensor should be squeezed so that it doesn't have three dimensions .. but not sure where this happens. I'm not entirely sure on the details, but at the moment I have this diff (thanks to https://github.com/openvinotoolkit/anomalib/issues/440#issuecomment-1191184221):

diff --git a/src/anomalib/models/patchcore/torch_model.py b/src/anomalib/models/patchcore/torch_model.py
index 7f4f11f5..00185d43 100644
--- a/src/anomalib/models/patchcore/torch_model.py
+++ b/src/anomalib/models/patchcore/torch_model.py
@@ -18,6 +18,14 @@ from anomalib.models.patchcore.anomaly_map import AnomalyMapGenerator
 from anomalib.pre_processing import Tiler

+def my_cdist(x1, x2):
+    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
+    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
+    res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
+    res = res.clamp_min_(1e-30).sqrt_()
+    return res
+
+
 class PatchcoreModel(DynamicBufferModule, nn.Module):
     """Patchcore Module."""

@@ -153,7 +161,7 @@ class PatchcoreModel(DynamicBufferModule, nn.Module):
             Tensor: Patch scores.
             Tensor: Locations of the nearest neighbor(s).
         """
-        distances = torch.cdist(embedding, self.memory_bank, p=2.0)  # euclidean norm
+        distances = my_cdist(embedding, self.memory_bank)  # euclidean norm
         if n_neighbors == 1:
             # when n_neighbors is 1, speed up computation by using min instead of topk
             patch_scores, locations = distances.min(1)
@@ -188,7 +196,7 @@ class PatchcoreModel(DynamicBufferModule, nn.Module):
         # indices of N_b(m^*) in the paper
         _, support_samples = self.nearest_neighbors(nn_sample, n_neighbors=self.num_neighbors)
         # 4. Find the distance of the patch features to each of the support samples
-        distances = torch.cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples], p=2.0)
+        distances = my_cdist(max_patches_features, self.memory_bank[support_samples].squeeze())
         # 5. Apply softmax to find the weights
         weights = (1 - F.softmax(distances.squeeze(1), 1))[..., 0]
         # 6. Apply the weight factor to the score

I haven't yet checked if the output of the ONNX model is correct, but at the very least it runs. I will check the output tomorrow.

blaz-r commented 1 year ago

@hgaiser thanks for this. This seems to be the problem, and from what I see here, I think cdist might even work, but that unsqueeze on dimension 1 needs to be changed as well as squeeze added to memory bank. Basically what you did with squeezing, but using cdist. Now I'm not sure if this would really work, as I'll need to test it and I'm not that familiar with patchcore code and cdist, but I'll test it and report back. It'd also be great to hear your findings with your function.

hgaiser commented 1 year ago

@hgaiser thanks for this. This seems to be the problem, and from what I see here, I think cdist might even work, but that unsqueeze on dimension 1 needs to be changed as well as squeeze added to memory bank. Basically what you did with squeezing, but using cdist. Now I'm not sure if this would really work, as I'll need to test it and I'm not that familiar with patchcore code and cdist, but I'll test it and report back. It'd also be great to hear your findings with your function.

I thought the same and I tried that, but it still allocated the wrong size of memory. My model with the patch applied seems to work the same as the pytorch version, but I need to fix something in the training itself. The patch seems to resolve my issue, so my investigation stops here for now since I have limited time to work on this. I hope someone can pick it up and make a proper fix for it.

blaz-r commented 1 year ago

Good to hear that it works. We'll see what to do from here on and try to implement this fix and test it. Thanks for all the input :)

jasonvanzelm commented 1 year ago

So in case anyone else is still struggling with this. The above my_cdist brakes during training because torch.addmm does not take a batch size. Replacing my_cdist by something like:

def euclidean_norm(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1)
    res = res.clamp_min_(0).sqrt_()
    return res

makes everything work for me. As far as I can tell torch implements cdist for p=2 in a similar way, but the onnx export of torch.cdist seems to become something like:

def onnx_cdist(x1,x2):
    x1 = x1.unsqueeze(-2)
    return (x1 - x2).pow(2).sum(dim=-1).sqrt()

laogonggong847 commented 1 year ago

@hgaiser, @jasonvanzelm, @blaz-r, @alexriedel1 . Thank you for providing a solution to the problem with using PatchCore's Onnx model. I noticed that your comments mentioned that your program got working correctly after the modification to the ./src/models/patchcore/torch_model.py. So I tried the same modifications as you did, as follows:

1: I added the __"my_cdist"__ program：

def my_cdist(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
    res = res.clamp_min_(1e-30).sqrt_()
    return res

2: I made two changes in __"./src/models/patchcore/torch_model.py"__ where torch.cdist is used

        # distances = torch.cdist(embedding, self.memory_bank, p=2.0)  # Annotated it

        distances = my_cdist(embedding, self.memory_bank)  # This is the program after the replacement

       # distances = torch.cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples], p=2.0) # Annotated it

        distances = my_cdist(max_patches_features, self.memory_bank[support_samples].squeeze()). # This is the program after the replacement

I have made sure that the modifications are the same as you said. My pytorch is version 1.12.1 (CUDA: 11.3). But when I modified the corresponding code to run train.py, the program runs for a while and then has problems. The details are as follows:

33C8583E8C841D1B597E0B0EA2B2BFF3 1B7B16A107F30A6C6CCC0FA8EE742B1D

As shown in the figure，its specific error message is：

my_cdist
    res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
RuntimeError: mat2 must be a matrix, got 3-D tensor

Why do I have this problem? Am I missing something that still needs to be changed? I look forward to hearing from you, thank you!

jasonvanzelm commented 1 year ago

If you replace my_cdist by

def my_cdist(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1)
    res = res.clamp_min_(0).sqrt_()
    return res

It should work. (The issue is that torch.addmm does not take batch dimensions, so replacing it by torch.matmul and an additional addition solves this problem.)

hgaiser commented 1 year ago

I tried exporting again last week and ran into the same issue. I believe I fixed it by only using the custom cdist function in compute_anomaly_score and not in nearest_neighbor.

However if the above solution also works, it's probably a safer bet :).

laogonggong847 commented 1 year ago

If you replace my_cdist by
def my_cdist(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1)
    res = res.clamp_min_(0).sqrt_()
    return res
It should work. (The issue is that torch.addmm does not take batch dimensions, so replacing it by torch.matmul and an additional addition solves this problem.)

@jasonvanzelm Thank you very much for your patient reply. When I replace my_cdist with the latest method you provided, should I use the official method for calculating the two distances in "torch_model.py", or the @hgaiser modified method?

The official method of calculating distances in two places in the __torch_mode.py__：
        1:  distances = torch.cdist(embedding, self.memory_bank, p=2.0)

        2:  distances = torch.cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples], p=2.0)

The corresponding parameters provided by hgaiser:
        1:   distances = my_cdist(embedding, self.memory_bank)

        2:  distances = my_cdist(max_patches_features, self.memory_bank[support_samples].squeeze())

Perhaps I should have been more precise about my question:

__1: After I update my my_cdist function to the one you provided, should the parameters I pass in the corresponding two places remain the same as the official one, as follows:__

After putting in the program with the demo you provided, pass in the corresponding parameters of my_cdist to modify：

        1:   distances = my_cdist(embedding, self.memory_bank)

        2:  distances = my_cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples])

       # The parameters passed by the official and @hgaiser are not the same in 2. The parameters passed by the updated MY_Cdist should be the same as the official ones, right?

2:Another question I have is whether this modification might have an impact on PatchCore's performance

@jasonvanzelm Thank you for your patience and explanation, thank you very much!

laogonggong847 commented 1 year ago

I tried exporting again last week and ran into the same issue. I believe I fixed it by only using the custom cdist function in compute_anomaly_score and not in nearest_neighbor.

However if the above solution also works, it's probably a safer bet :).

Hello @hgaiser ，Thank you very much for your answer, just @jasonvanzelm provided a new solution idea, you can also try. I am testing it now and hope the modification will work once.

Numerous methods are provided in Anomalib, and it seems that PatchCore performs well among all the models that can be trained just once to get there. But I checked the related material with the original paper and found that Padim using Wide_ResNet 50 also performs well. But strangely, the model cannot be trained correctly after I set the _backbone of Padim to wideResNet 50, you can refer to #1045 for more details.

I wonder if you can provide some ideas for this problem, thank you very much!

bellenfanttyler commented 1 year ago

Hey everybody, I am also having the issue above when exporting to ONNX. I've tried

I tried exporting again last week and ran into the same issue. I believe I fixed it by only using the custom cdist function in compute_anomaly_score and not in nearest_neighbor.

However if the above solution also works, it's probably a safer bet :).

and can get the model to train and inference using the Pytorch version of the model just fine. When using the exported ONNX file of the same model on an Nvidia Triton inference server I still receive the memory error (onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632.). Have @hgaiser @jasonvanzelm found the same with your ONNX exports? Appreciate any help!

bellenfanttyler commented 1 year ago

@laogonggong847 did you get any of the above combinations to work? I've tried every permutation and the ONNX model still has the memory issue at inference. I can get training to run to completion and the torch model runs through testing fine, but still have issues with the memory error (onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632.)

blaz-r commented 1 year ago

@bellenfanttyler one thing you also need to take care when using above modified cdist is to also change arguments when calling the function. Did you adjust that as well?

bellenfanttyler commented 1 year ago

@bellenfanttyler one thing you also need to take care when using above modified cdist is to also change arguments when calling the function. Did you adjust that as well?

@blaz-r Appreciate the response! I did change the arguments when trying to get this to work. I was able to get a successful ONNX export today without the memory error during inference. What ended up working was a combination of version changes and using the custom my_cdist function:

def my_cdist(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1)
    res = res.clamp_min_(0).sqrt_()
    return res

only in the nearest_neighbor and not in the compute_anomaly_score. For nearest_neighbor I changed the arguments according to what @hgaiser provided in the diff file above (distances = my_cdist(embedding, self.memory_bank). I still have to validate the scores/outputs but it appears to be working. Thanks to everyone above for putting in the legwork to help get a solution!