Closed MhdKAT closed 1 year ago
Hello, I'm not sure what the problem is, but I can reproduce this. It definitely isn't normal that model requests 288gigs of memory, but I'm not entirely sure if that node is the only problem.
I've encountered the same issue,
My current workaround is to either use the .ckpt file with the pytorch lightning interpreter or extract the memory_bank as a numpy array, define a custom PatchCoreModel wrapper and manually load & store the memory_bank as a tensor.
I was only able to run the ONNX file on a server with 64 GB RAM when using an input size of 224 and a sampling rate of 0.01, so that I get ~6.7k entries (shape of 6700x1536) for the memory_bank.
Using different Docker images or Interpreters made no difference (OpenVINO or ONNX Runtime) Tried these with a 3090: https://github.com/microsoft/onnxruntime/blob/main/dockerfiles/Dockerfile.cuda nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04
I'm running into the same issue, trying to debug what is happening. I noticed that my memory bank is shaped [48000, 1536]
(images are shaped [400, 400]
) and that the amount of memory it tries to allocate is 737280000000
bytes. It probably isn't a coincidence that 48000 * 1536 = 73728000
. I would expect it to allocate 48000 * 1536 * 4
bytes though, as the memory bank has dtype float32
, not 48000 * 1536 * 10000
bytes. I'm trying to find out where this issue comes from .. I'll let you know if I find anything, but wanted to share my findings in the meantime.
I've narrowed it down quite a bit to this line:
It seems that in the ONNX representation (and apparently also OpenVINO?), the input for cdist
is shaped 1, 48000, 1536
(called %onnx::Sub_770
in ONNX), whereas it is shaped 48000, 1536
according to pytorch. The other input (%/Reshape_output_0)
is shaped 2500, 1536
, but seems to get unsqueezed to 2500, 1, 1536
(presumably to match the other input). The calculated output shape of cdist
is then 2500, 48000, 1536
, which is way too large.
I believe the Sub_770
tensor should be squeezed so that it doesn't have three dimensions .. but not sure where this happens. I'm not entirely sure on the details, but at the moment I have this diff (thanks to https://github.com/openvinotoolkit/anomalib/issues/440#issuecomment-1191184221):
diff --git a/src/anomalib/models/patchcore/torch_model.py b/src/anomalib/models/patchcore/torch_model.py
index 7f4f11f5..00185d43 100644
--- a/src/anomalib/models/patchcore/torch_model.py
+++ b/src/anomalib/models/patchcore/torch_model.py
@@ -18,6 +18,14 @@ from anomalib.models.patchcore.anomaly_map import AnomalyMapGenerator
from anomalib.pre_processing import Tiler
+def my_cdist(x1, x2):
+ x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
+ x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
+ res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
+ res = res.clamp_min_(1e-30).sqrt_()
+ return res
+
+
class PatchcoreModel(DynamicBufferModule, nn.Module):
"""Patchcore Module."""
@@ -153,7 +161,7 @@ class PatchcoreModel(DynamicBufferModule, nn.Module):
Tensor: Patch scores.
Tensor: Locations of the nearest neighbor(s).
"""
- distances = torch.cdist(embedding, self.memory_bank, p=2.0) # euclidean norm
+ distances = my_cdist(embedding, self.memory_bank) # euclidean norm
if n_neighbors == 1:
# when n_neighbors is 1, speed up computation by using min instead of topk
patch_scores, locations = distances.min(1)
@@ -188,7 +196,7 @@ class PatchcoreModel(DynamicBufferModule, nn.Module):
# indices of N_b(m^*) in the paper
_, support_samples = self.nearest_neighbors(nn_sample, n_neighbors=self.num_neighbors)
# 4. Find the distance of the patch features to each of the support samples
- distances = torch.cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples], p=2.0)
+ distances = my_cdist(max_patches_features, self.memory_bank[support_samples].squeeze())
# 5. Apply softmax to find the weights
weights = (1 - F.softmax(distances.squeeze(1), 1))[..., 0]
# 6. Apply the weight factor to the score
I haven't yet checked if the output of the ONNX model is correct, but at the very least it runs. I will check the output tomorrow.
@hgaiser thanks for this. This seems to be the problem, and from what I see here, I think cdist might even work, but that unsqueeze on dimension 1 needs to be changed as well as squeeze added to memory bank. Basically what you did with squeezing, but using cdist. Now I'm not sure if this would really work, as I'll need to test it and I'm not that familiar with patchcore code and cdist, but I'll test it and report back. It'd also be great to hear your findings with your function.
@hgaiser thanks for this. This seems to be the problem, and from what I see here, I think cdist might even work, but that unsqueeze on dimension 1 needs to be changed as well as squeeze added to memory bank. Basically what you did with squeezing, but using cdist. Now I'm not sure if this would really work, as I'll need to test it and I'm not that familiar with patchcore code and cdist, but I'll test it and report back. It'd also be great to hear your findings with your function.
I thought the same and I tried that, but it still allocated the wrong size of memory. My model with the patch applied seems to work the same as the pytorch version, but I need to fix something in the training itself. The patch seems to resolve my issue, so my investigation stops here for now since I have limited time to work on this. I hope someone can pick it up and make a proper fix for it.
Good to hear that it works. We'll see what to do from here on and try to implement this fix and test it. Thanks for all the input :)
So in case anyone else is still struggling with this. The above my_cdist
brakes during training because torch.addmm
does not take a batch size. Replacing my_cdist
by something like:
def euclidean_norm(x1, x2):
x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1)
res = res.clamp_min_(0).sqrt_()
return res
makes everything work for me. As far as I can tell torch implements cdist
for p=2
in a similar way, but the onnx export of torch.cdist
seems to become something like:
def onnx_cdist(x1,x2):
x1 = x1.unsqueeze(-2)
return (x1 - x2).pow(2).sum(dim=-1).sqrt()
@hgaiser, @jasonvanzelm, @blaz-r, @alexriedel1 . Thank you for providing a solution to the problem with using PatchCore's Onnx model. I noticed that your comments mentioned that your program got working correctly after the modification to the ./src/models/patchcore/torch_model.py
. So I tried the same modifications as you did, as follows:
1: I added the __"my_cdist"__ program:
def my_cdist(x1, x2):
x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
res = res.clamp_min_(1e-30).sqrt_()
return res
2: I made two changes in __"./src/models/patchcore/torch_model.py"__ where torch.cdist
is used
# distances = torch.cdist(embedding, self.memory_bank, p=2.0) # Annotated it
distances = my_cdist(embedding, self.memory_bank) # This is the program after the replacement
# distances = torch.cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples], p=2.0) # Annotated it
distances = my_cdist(max_patches_features, self.memory_bank[support_samples].squeeze()). # This is the program after the replacement
I have made sure that the modifications are the same as you said. My pytorch is version 1.12.1 (CUDA: 11.3). But when I modified the corresponding code to run train.py
, the program runs for a while and then has problems. The details are as follows:
As shown in the figure,its specific error message is:
my_cdist
res = torch.addmm(x2_norm.transpose(-2, -1), x1, x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
RuntimeError: mat2 must be a matrix, got 3-D tensor
Why do I have this problem? Am I missing something that still needs to be changed? I look forward to hearing from you, thank you!
If you replace my_cdist by
def my_cdist(x1, x2):
x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1)
res = res.clamp_min_(0).sqrt_()
return res
It should work. (The issue is that torch.addmm
does not take batch dimensions, so replacing it by torch.matmul
and an additional addition solves this problem.)
I tried exporting again last week and ran into the same issue. I believe I fixed it by only using the custom cdist function in compute_anomaly_score
and not in nearest_neighbor
.
However if the above solution also works, it's probably a safer bet :).
If you replace my_cdist by
def my_cdist(x1, x2): x1_norm = x1.pow(2).sum(dim=-1, keepdim=True) x2_norm = x2.pow(2).sum(dim=-1, keepdim=True) res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1) res = res.clamp_min_(0).sqrt_() return res
It should work. (The issue is that
torch.addmm
does not take batch dimensions, so replacing it bytorch.matmul
and an additional addition solves this problem.)
@jasonvanzelm Thank you very much for your patient reply. When I replace my_cdist with the latest method you provided, should I use the official method for calculating the two distances in "torch_model.py", or the @hgaiser modified method?
The official method of calculating distances in two places in the __torch_mode.py__:
1: distances = torch.cdist(embedding, self.memory_bank, p=2.0)
2: distances = torch.cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples], p=2.0)
The corresponding parameters provided by hgaiser:
1: distances = my_cdist(embedding, self.memory_bank)
2: distances = my_cdist(max_patches_features, self.memory_bank[support_samples].squeeze())
Perhaps I should have been more precise about my question:
__1: After I update my my_cdist function to the one you provided, should the parameters I pass in the corresponding two places remain the same as the official one, as follows:__
After putting in the program with the demo you provided, pass in the corresponding parameters of my_cdist to modify:
1: distances = my_cdist(embedding, self.memory_bank)
2: distances = my_cdist(max_patches_features.unsqueeze(1), self.memory_bank[support_samples])
# The parameters passed by the official and @hgaiser are not the same in 2. The parameters passed by the updated MY_Cdist should be the same as the official ones, right?
2:Another question I have is whether this modification might have an impact on PatchCore's performance
@jasonvanzelm Thank you for your patience and explanation, thank you very much!
I tried exporting again last week and ran into the same issue. I believe I fixed it by only using the custom cdist function in
compute_anomaly_score
and not innearest_neighbor
.However if the above solution also works, it's probably a safer bet :).
Hello @hgaiser ,Thank you very much for your answer, just @jasonvanzelm provided a new solution idea, you can also try. I am testing it now and hope the modification will work once.
Numerous methods are provided in Anomalib, and it seems that PatchCore performs well among all the models that can be trained just once to get there. But I checked the related material with the original paper and found that Padim using Wide_ResNet 50
also performs well. But strangely, the model cannot be trained correctly after I set the _backbone of Padim to wideResNet 50, you can refer to #1045 for more details.
I wonder if you can provide some ideas for this problem, thank you very much!
Hey everybody, I am also having the issue above when exporting to ONNX. I've tried
I tried exporting again last week and ran into the same issue. I believe I fixed it by only using the custom cdist function in
compute_anomaly_score
and not innearest_neighbor
.However if the above solution also works, it's probably a safer bet :).
and can get the model to train and inference using the Pytorch version of the model just fine. When using the exported ONNX file of the same model on an Nvidia Triton inference server I still receive the memory error (onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632.). Have @hgaiser @jasonvanzelm found the same with your ONNX exports? Appreciate any help!
@laogonggong847 did you get any of the above combinations to work? I've tried every permutation and the ONNX model still has the memory issue at inference. I can get training to run to completion and the torch model runs through testing fine, but still have issues with the memory error (onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632.)
@bellenfanttyler one thing you also need to take care when using above modified cdist is to also change arguments when calling the function. Did you adjust that as well?
@bellenfanttyler one thing you also need to take care when using above modified cdist is to also change arguments when calling the function. Did you adjust that as well?
@blaz-r Appreciate the response! I did change the arguments when trying to get this to work. I was able to get a successful ONNX export today without the memory error during inference. What ended up working was a combination of version changes and using the custom my_cdist
function:
def my_cdist(x1, x2):
x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
res = x1_norm - 2 * torch.matmul(x1,x2.transpose(-2,-1)) + x2_norm.transpose(-2, -1)
res = res.clamp_min_(0).sqrt_()
return res
only in the nearest_neighbor
and not in the compute_anomaly_score
. For nearest_neighbor
I changed the arguments according to what @hgaiser provided in the diff file above (distances = my_cdist(embedding, self.memory_bank
). I still have to validate the scores/outputs but it appears to be working. Thanks to everyone above for putting in the legwork to help get a solution!
Describe the bug
Hi all! Has anyone tried to do inference with the Patchcore exported ONNX from anomalib with onnxruntime for example? The model is apprently buggy as it is asking for an insane amount of memory (haven't been able to run it on 80GB machine on CPU for example). The error I keep getting is :
onnxruntime/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 288601669632. The problematic layer is apparently this sub node. Does anyone has any clue how to fix it? or is there any workaround?
Dataset
Folder
Model
PatchCore
Steps to reproduce the behavior
1 - Install Anomalib 2 - Train a Patchcore model 3 - Try to infer with the ONNX
OS information
OS information:
Expected behavior
exported ONNX model from patchcore is not working.
Screenshots
No response
Pip/GitHub
pip
What version/branch did you use?
0.4.0
Configuration YAML
Logs
Code of Conduct