MIssing torchvision::nms error in the C++ CUDA TorchVision API

🐛 Describe the bug

I'm unable to load a my trained MaskRCNN model (using the one from the torchvision Python module). I'm converting it to TorchScript using torch.jit.script, saving it as a .pt file and finally using the torch::jit::load from LibTorch:

torch::NoGradGuard no_grad;
model = torch::jit::load(model_path);

Nothing fancy, but I'm getting the following error:

terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():  
Unknown builtin op: torchvision::nms.
Could not find any similar ops to torchvision::nms. This op may not exist or may not be currently supported in TorchScript.
:
  File "/home/aurelien/Documents/Projects/autotrain-env/lib/python3.8/site-packages/torchvision/ops/boxes.py", line 40
        _log_api_usage_once(nms)
    _assert_has_ops()
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
Serialized   File "code/__torch__/torchvision/ops/boxes.py", line 154
  _64 = __torch__.torchvision.extension._assert_has_ops
  _65 = _64()
  _66 = ops.torchvision.nms(boxes, scores, iou_threshold)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  return _66
'nms' is being compiled since it was called from '_batched_nms_vanilla'
  File "/home/aurelien/Documents/Projects/autotrain-env/lib/python3.8/site-packages/torchvision/ops/boxes.py", line 108
    for class_id in torch.unique(idxs):
        curr_indices = torch.where(idxs == class_id)[0]
        curr_keep_indices = nms(boxes[curr_indices], scores[curr_indices], iou_threshold)
                            ~~~ <--- HERE
        keep_mask[curr_indices[curr_keep_indices]] = True
    keep_indices = torch.where(keep_mask)[0]
Serialized   File "code/__torch__/torchvision/ops/boxes.py", line 83
    _31 = torch.index(boxes, _30)
    _32 = annotate(List[Optional[Tensor]], [curr_indices])
    curr_keep_indices = __torch__.torchvision.ops.boxes.nms(_31, torch.index(scores, _32), iou_threshold, )
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _33 = annotate(List[Optional[Tensor]], [curr_keep_indices])
    _34 = torch.index(curr_indices, _33)
'_batched_nms_vanilla' is being compiled since it was called from 'batched_nms'
Serialized   File "code/__torch__/torchvision/ops/boxes.py", line 35
    idxs: Tensor,
    iou_threshold: float) -> Tensor:
  _9 = __torch__.torchvision.ops.boxes._batched_nms_vanilla
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  _10 = __torch__.torchvision.ops.boxes._batched_nms_coordinate_trick
  _11 = torch.numel(boxes)
'batched_nms' is being compiled since it was called from 'RegionProposalNetwork.filter_proposals'
Serialized   File "code/__torch__/torchvision/models/detection/rpn.py", line 72
    _11 = __torch__.torchvision.ops.boxes.clip_boxes_to_image
    _12 = __torch__.torchvision.ops.boxes.remove_small_boxes
    _13 = __torch__.torchvision.ops.boxes.batched_nms
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    num_images = (torch.size(proposals))[0]
    device = ops.prim.device(proposals)
'RegionProposalNetwork.filter_proposals' is being compiled since it was called from 'RegionProposalNetwork.forward'
  File "/home/aurelien/Documents/Projects/autotrain-env/lib/python3.8/site-packages/torchvision/models/detection/rpn.py", line 353
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
        proposals = proposals.view(num_images, -1, 4)
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
                        ~~~~~~~~~~~~~~~~~~~~~ <--- HERE

        losses = {}
Serialized   File "code/__torch__/torchvision/models/detection/rpn.py", line 43
    proposals0 = torch.view(proposals, [num_images, -1, 4])
    image_sizes = images.image_sizes
    _8 = (self).filter_proposals(proposals0, objectness0, image_sizes, num_anchors_per_level, )
                                                                       ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    boxes, scores, = _8
    losses = annotate(Dict[str, Tensor], {})

Aborted (core dumped)

I tried running this project and it's working on CPU but not on GPU, I get this output:

Loading model
Model loaded
[W faster_rcnn.py:107] Warning: RCNN always returns a (Losses, Detections) tuple in scripting (function )
ok
output({}, [{boxes: [ CPUFloatType{0,4} ], labels: [ CPULongType{0} ], scores: [ CPUFloatType{0} ]}, {boxes: [ CPUFloatType{0,4} ], labels: [ CPULongType{0} ], scores: [ CPUFloatType{0} ]}])
terminate called after throwing an instance of 'c10::NotImplementedError'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torchvision/models/detection/rpn.py", line 122, in forward
      lvl1 = torch.index(lvl0, _28)
      nms_thresh = self.nms_thresh
      keep1 = _13(boxes2, scores1, lvl1, nms_thresh, )
              ~~~ <--- HERE
      keep2 = torch.slice(keep1, 0, None, (self).post_nms_top_n())
      _29 = annotate(List[Optional[Tensor]], [keep2])
  File "code/__torch__/torchvision/ops/boxes.py", line 52, in batched_nms
    _16 = _17
  else:
    _18 = _10(boxes, scores, idxs, iou_threshold, )
          ~~~ <--- HERE
    _16 = _18
  return _16
  File "code/__torch__/torchvision/ops/boxes.py", line 109, in _batched_nms_coordinate_trick
    _47 = torch.unsqueeze(torch.slice(offsets), 1)
    boxes_for_nms = torch.add(boxes, _47)
    keep = __torch__.torchvision.ops.boxes.nms(boxes_for_nms, scores, iou_threshold, )
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _42 = keep
  return _42
  File "code/__torch__/torchvision/ops/boxes.py", line 154, in nms
  _64 = __torch__.torchvision.extension._assert_has_ops
  _65 = _64()
  _66 = ops.torchvision.nms(boxes, scores, iou_threshold)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  return _66

Traceback of TorchScript, original code (most recent call last):
  File "/home/aurelien/.local/lib/python3.8/site-packages/torchvision/models/detection/rpn.py", line 266, in forward

            # non-maximum suppression, independently done per level
            keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE

            # keep only topk scoring predictions
  File "/home/aurelien/.local/lib/python3.8/site-packages/torchvision/ops/boxes.py", line 74, in batched_nms
        return _batched_nms_vanilla(boxes, scores, idxs, iou_threshold)
    else:
        return _batched_nms_coordinate_trick(boxes, scores, idxs, iou_threshold)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  File "/home/aurelien/.local/lib/python3.8/site-packages/torchvision/ops/boxes.py", line 40, in nms
        _log_api_usage_once(nms)
    _assert_has_ops()
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torchvision::nms' is only available for these backends: [CPU, BackendSelect, Python, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, Tracer, AutocastCPU, Autocast, Batched, VmapMode, Functionalize].

CPU: registered at /home/aurelien/Downloads/vision/torchvision/csrc/ops/cpu/nms_kernel.cpp:112 [kernel]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:47 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback]
AutogradCPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback]
AutogradCUDA: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback]
AutogradXLA: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:51 [backend fallback]
AutogradLazy: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:55 [backend fallback]
AutogradXPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback]
AutogradMLC: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:59 [backend fallback]
AutogradHPU: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:68 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:293 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:461 [backend fallback]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:305 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1059 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:52 [backend fallback]

Exception raised from reportError at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:434 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f21dcc040eb in /opt/libtorch/lib/libc10.so)
frame #1: c10::impl::OperatorEntry::reportError(c10::DispatchKey) const + 0xa48 (0x7f215c2e2d98 in /opt/libtorch/lib/libtorch_cpu.so)
frame #2: c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const + 0x44e (0x7f215e9dc1ae in /opt/libtorch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x3527b72 (0x7f215e641b72 in /opt/libtorch/lib/libtorch_cpu.so)
frame #4: torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 0x52 (0x7f215e62fd02 in /opt/libtorch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x3508f4d (0x7f215e622f4d in /opt/libtorch/lib/libtorch_cpu.so)
frame #6: torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) const + 0x194 (0x7f215e2d2da4 in /opt/libtorch/lib/libtorch_cpu.so)
frame #7: torch::jit::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) + 0xc3 (0x555faadc7cb5 in ./test_frcnn_tracing)
frame #8: main + 0x592 (0x555faadc31b0 in ./test_frcnn_tracing)
frame #9: __libc_start_main + 0xf3 (0x7f21222460b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #10: _start + 0x2e (0x555faadc28ee in ./test_frcnn_tracing)

Aborted (core dumped)

I'm not sure what I'm missing! I've seen other people having this issue, but I have not been able to resolve it.

Versions

LibTorch version 1.11.0 downloaded from the site. Tested with TorchVision 0.11 and 0.12 (built with cmake -DWITH_CUDA=on -DCMAKE_PREFIX_PATH=/opt/libtorch/share/cmake/Torch .. from their branch) Pop-OS 20.04 CUDA 11.3

pytorch / vision

MIssing torchvision::nms error in the C++ CUDA TorchVision API #5697

🐛 Describe the bug

Versions