microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.68k stars 2.93k forks source link

Non-zero status code returned while running TopK node. (ssdlite320_mobilenet_v3_large) #12669

Open goksinan opened 2 years ago

goksinan commented 2 years ago

Describe the bug I fine-tuned an SSDlite320 MobileNetV3-Large model using the code provided by torhcvision. The pytorch model was trained nicely and worked as expected. I also successfully exported the pytorch model into .onnx model. However, I keep getting the following error when I try to run the .onnx model using ORT (ort_session.run(None, {input_name: img})):

[E:onnxruntime:, sequential_executor.cc:368 onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running TopK node. Name:‘TopK_1254’ Status Message: k argument [4] should not be greater than specified axis dim value [3]

Urgency Needs to be fixed ASAP to be able to run the model using onnxruntime.

System information

To Reproduce This is how I exported to .onnx model:

model_name = 'checkpoint.pth'
num_classes = 3
model = torchvision.models.detection.ssdlite320_mobilenet_v3_large(weights=None,
                                                                   weights_backbone=MobileNet_V3_Large_Weights.IMAGENET1K_V1,
                                                                   num_classes=num_classes)
checkpoint = torch.load(model_name, map_location="cpu")
model.load_state_dict(checkpoint["model"])

device = torch.device('cpu')

img = torch.randn((1, 3, 224, 224), device=device)
img.requires_grad = False

with torch.no_grad():
    img = img.to(device)
    model.to(device)
    model.eval()
    torch.onnx.export(model,
                      img,
                      checkpoint.onnx,
                      verbose=False,
                      do_constant_folding=True,
                      opset_version=12,
                      input_names=['images'],
                      output_names=['boxes', 'labels', 'scores'],
                      )

Additional context I know that detection models sometimes need an actual input image (rather than random input) to be serialized properly. However, the behavior did not change regardless of the input I used during export.

RandySheriffH commented 2 years ago

On possible fix would be that change the line here, so that when K > shape[dim], just take shape[dim] as default.

skottmckay commented 2 years ago

The ONNX spec doesn't say that's allowed though. It specifies the output shape as having 'k' for the selected axis.

Is it possible that the '3' in the error is due to num_classes being 3 and the model is assuming that there will be 4 or more classes? The default value of num_classes in that pytorch model is 91.

goksinan commented 2 years ago

Those numbers in the error message are not fixed. If I use another checkpoint (let's say model_epoch20 rather than model_epoch50), the error message remains, but numbers may change, as in:

Name:‘TopK_1246’ Status Message: k argument [15] should not be greater than specified axis dim value [5]

Another observation that might be useful is that if I just create the model with 3 classes but don't load my weights, the error disappears. So, I get the error when I try to use my own saved model.

mohamin8 commented 2 years ago

Did you solve it? I have the same problem

SaverioFrancesco commented 1 year ago

I'm also interested in this. I have a similar error regarding this exact model (ssdlite320_mobilenet_v3_large) :

RuntimeException [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running TopK node. Name:'TopK_6101' Status Message: /onnxruntimesrc/onnxruntime/core/providers/cuda/math/topk.cc:64 onnxruntime::common::Status onnxruntime::cuda::TopK::ComputeInternal(onnxruntime::OpKernelContext*) const [with bool inputk = true] K >= 0 && K_ <= tensor_X->Shape().GetDims()[axis] was false.

Is this related?

gloomyfish1998 commented 1 year ago

i got the same issue as well~~~~

gloomyfish1998 commented 1 year ago

RUNTIME_EXCEPTION : Non-zero status code returned while running TopK node. Name:'TopK_1683'

skottmckay commented 1 year ago

Possibly try a newer version of torchvision. There was a fix checked in to pytorch to try and address this and export the onnx model correctly.

https://github.com/pytorch/vision/pull/5310

Zalways commented 9 months ago

have you solved this problem? imet same issue: Non-zero status code returned while running TopK node. Name:'/model/TopK' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

i'll apreciate if you could help me with my problem! @skottmckay @RandySheriffH