pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.05k stars 6.93k forks source link

[JIT] Not supported for maskrcnn_resnet50_fpn #1002

Closed rbrigden closed 4 years ago

rbrigden commented 5 years ago

I am trying to accelerate the maskrcnn_resnet50_fpn pretrained model using JIT tracing provided by pytorch. It appears that some operations present in this model are not supported by pytorch JIT.

Are these models supposed to have JIT support officially? If not, would you be able to provide advice for a workaround?

To replicate, running:

import torch
import torchvision
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.eval()
traced_net = torch.jit.trace(model, torch.rand(1, 3,800, 800))

produces

RuntimeError: log2_vml_cpu not implemented for 'Long

Thank you.

soumith commented 5 years ago

this actually looks like a bug in scale = 2 ** torch.tensor(approx_scale).log2().round().item() in torchvision/ops/poolers.py.

If approx_scale here is an exact integer, the tensor will be a LongTensor, which is unexpected.

That should be changed to torch.tensor(approx_scale, dtype=torch.float32)

fmassa commented 5 years ago

@rbrigden as mentioned in the release notes, the detection models do not yet support JIT, in particular because we use custom ops which are not registered with the TorchScript ops.

We plan to add full JIT support for the detection models in follow-up releases.

fmassa commented 5 years ago

And @soumith good catch about the location of the error. But this looks like a problem with tracing, because in https://github.com/pytorch/vision/blob/aa32c9376c46eb284f2b091f3eb98aec4fd64b03/torchvision/ops/poolers.py#L100 we force approx_scale to be a float, so the JIT should take that into account. But a workaround solution could be to explicitly force a dtype in torch.tensor, as you mentioned

lzp0916 commented 5 years ago

@fmassa dear fmassa, what time does the detection models support support JIT? thank you

fmassa commented 5 years ago

@lzp0916 A first PyTorch PR that would enable us to start making the model TorchScript friendly has just been sent to PyTorch https://github.com/pytorch/pytorch/pull/22582

But I'd say it will still take a few months to get the detection models to support TorchScript.

cc @fbbradheintz

XushengLee commented 5 years ago

@soumith , @fmassa I change the the code to torch.tensor(approx_scale, dtype=torch.float32) in torchvision/ops/poolers.py as soumith said. It worked for that error. But there came another error. I think it's about the TorchScript is not supporting maskrcnn's output format here are the logging: ` RuntimeError: Only tensors or tuples of tensors can be output from traced functions (getNestedOutputTrace at /opt/conda/conda-bld/pytorch_1556653099582/work/torch/csrc/jit/tracer.cpp:200) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f7bb5b1adc5 in /home/lxs/anaconda3/envs/torchscript/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: torch::jit::tracer::getNestedOutputTrace(std::shared_ptr const&, c10::IValue const&) + 0x23e (0x7f7bb39d5cee in /home/lxs/anaconda3/envs/torchscript/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #2: torch::jit::tracer::exit(std::vector<c10::IValue, std::allocator > const&) + 0x2f (0x7f7bb39d5dbf in /home/lxs/anaconda3/envs/torchscript/lib/python3.6/site-packages/torch/lib/libtorch.so.1) frame #3: + 0x447ab3 (0x7f7be4e3eab3 in /home/lxs/anaconda3/envs/torchscript/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x45a8b4 (0x7f7be4e518b4 in /home/lxs/anaconda3/envs/torchscript/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #5: + 0x12ce4a (0x7f7be4b23e4a in /home/lxs/anaconda3/envs/torchscript/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #20: __libc_start_main + 0xe7 (0x7f7bf41f0b97 in /lib/x86_64-linux-gnu/libc.so.6) ` And it seems too hard for me to work around it, `torchvision.models.detection` is such a great work, it make my code a lot easier. hope this problem can be fixed soon : )
fmassa commented 5 years ago

@XushengLee adding support for TorchScript for all models in torchvision is in the plans, but it will still take a few months before we are there.

rmzr7 commented 5 years ago

@XushengLee you can fix the second error if you change how the outputs of the inference are put into a dictionary but rather just pass the tensors directly

XushengLee commented 5 years ago

@remzr7 thank you for your help and I tried that, and it solved the problem of output. But there is another error, and the logging is not as clear as before. However, I think it regards the input format. I find out the maskrcnn in torchvision.models.detection takes in a list of channel-first image tensors at least during the evaluation, not a typical 4-D tensor.

# this snippet is from engine.py of the torchvion.models.detection 
for images, targets in metric_logger.log_every(data_loader, print_freq, header):
    images = list(image.to(device) for image in images)
    targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
    loss_dict = model(images, targets)
    losses = sum(loss for loss in loss_dict.values())
rmzr7 commented 5 years ago

Oh yes, I think you can also disable the GeneralizedRCNN Transforms that the underlying GeneralizedRCNN Class applies, but instead perform the transformations (i.e resize/to_tensor) before you do model.forward()

On Tue, Jul 30, 2019 at 11:39 PM XushengLee notifications@github.com wrote:

@remzr7 https://github.com/remzr7 thank you for your help and I tried that, and it solved the problem of output. But there is another error, and the logging is not as clear as before. However, I think it regards the input format. I find out the maskrcnn in torchvision.models.detection takes in a list of channel-first image tensors at least during the evaluation, not a typical 4-D tensor.

this snippet is from engine.py of the torchvion.models.detection

for images, targets in metric_logger.log_every(data_loader, print_freq, header): images = list(image.to(device) for image in images) targets = [{k: v.to(device) for k, v in t.items()} for t in targets] loss_dict = model(images, targets) losses = sum(loss for loss in loss_dict.values())

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/vision/issues/1002?email_source=notifications&email_token=ABJKSRAP22V4664N5N7KUWTQCEXRNA5CNFSM4HVHGJW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3GIAKY#issuecomment-516718635, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJKSREGTHSXJJQHDLAJIVTQCEXRNANCNFSM4HVHGJWQ .

cted18 commented 5 years ago

@remzr7 It doesn't seem that simple. I took the transforms in GeneralizedRCNN outside and changed the output of GeneralizedRCNN to tuples instead of a dict.

Now, it seems that I would have to change the outputs of all modules recursively, i.e.,

  1. IntermediateLayerGetter(..) returns an OrderedDict
  2. FeaturePyramidNetwork(..) returns an OrderedDict
  3. BackboneWithFPN(..) returns an OrderedDict and so on..

I changed the outputs of all of them to tuples of tensors except for IntermediateLayerGetter(..) I have not been able to get around IntermediateLayerGetter(..) by changing the OrderedDict structure being used because torchscript at this point cannot deal with OrderedDict outputs.

@soumith @fmassa since OrderedDict outputs are being used everywhere in detection, maybe it would be easier to add torchscript support for returning OrderedDicts? Is there a quick workaround to solve this problem?

fmassa commented 5 years ago

@cted18 yes, OrderedDict support in torchscript is something that should be added.

And we are starting to work on adding support for maskrcnn_resnet50_fpn to work on torchscript / traceable, a first PR in this series has been sent in https://github.com/pytorch/vision/pull/1267

cc @eellison for OrderedDict support in torchscript

eellison commented 5 years ago

@cted18 Yes i'll be working on adding OrderedDict to support fcn_resnet101. I think together with op support added in https://github.com/pytorch/vision/pull/1267 it shouldn't be too hard to support in script.

lzp0916 commented 5 years ago

@fmassa dear fmassa, I am using torch.jit.trace to encounter an error as follows: "RuntimeError: Tried to trace <torch.torchvision.ops.misc.FrozenBatchNorm2d object at 0000029EB0B365E0> but it is not part of the active trace. Modules that are called during a trace must be registered as submodules of the thing being traced." How can I solve this problem? windows pytorch:1.3.0.dev20190920 torchvision:0.5.0.dev20190924 model:fasterrcnn_resnet50_fpn

fmassa commented 5 years ago

@lzp0916 this error will be solved when https://github.com/pytorch/vision/pull/1329 is merged

hhbyyh commented 4 years ago

The issue is critical for putting the model into production system. Thanks for working on this.

creotiv commented 4 years ago

2 ** torch.tensor(approx_scale).log2().round()

can someone explain why here if approx_scale < 1 it doesnt got rounded to integer? It's some hack or normal behavior?

fmassa commented 4 years ago

@creotiv it's an approximation, that avoids us having to manually specify what's the downscaling for layer n.

creotiv commented 4 years ago

@fmassa no i understand that. i mean why function round() not rounding 0.123 for example to zero(only after log function)? Cause i dont see anything like that in docs https://pytorch.org/docs/stable/torch.html?highlight=round#torch.round, and it looking like bug

creotiv commented 4 years ago

And also torch.log2(2**torch.tensor(0.123,dtype=torch.float64)).round() return 0.

fmassa commented 4 years ago

@creotiv FYI this is unrelated to the issue (which is that maskrcnn_resnet50_fpn is not yet scriptable), but I don't understand your point.

Can you open a new issue describing with an example what you think is the problem?

creotiv commented 4 years ago

@fmassa already https://github.com/pytorch/pytorch/issues/28284

gemmit commented 4 years ago

RuntimeError: Only tensors or tuples of tensors can be output from traced functions

@XushengLee how did you get rid of the error "RuntimeError: Only tensors or tuples of tensors can be output from traced functions"? I am currently having the same issue when trying to trace Maskrcnn model from trochvision with the following script

` import torch import torchvision

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True) model.eval() test_data = torch.rand(1, 3, 480, 640) traced_model = torch.jit.trace(model, test_data) `

fmassa commented 4 years ago

@gemmit support for tracing / scripting maskrcnn is coming soon, check https://github.com/pytorch/vision/pull/1407 and https://github.com/pytorch/vision/pull/1461

gemmit commented 4 years ago

@fmassa okay, thanks for the info. Will check the links

fmassa commented 4 years ago

@gemmit ~tracing should already be supported for maskrcnn~. Using torch.jit.script will be supported in the coming weeks

@lara-hdr I've just tried tracing maskrcnn, and I got an error

import torch, torchvision
m = torchvision.models.detection.maskrcnn_resnet50_fpn()
m.eval()

traced_model = torch.jit.trace(m, [[torch.rand(3, 300, 300)]]

I get the following error

RuntimeError: Only tensors or tuples of tensors can be output from traced functions (getOutput at /Users/distiller/project/conda/conda-bld/pytorch_1572429967983/work/torch/csrc/jit/tracer.cpp:211)
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 135 (0x112b608b7 in libc10.dylib)
frame #1: torch::jit::tracer::TracingState::getOutput(c10::IValue const&) + 1593 (0x11b1d8549 in libtorch.dylib)
frame #2: torch::jit::tracer::trace(std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >, std::__1::function<std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> > (std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >)> const&, std::__1::function<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > (torch::autograd::Variable const&)>, bool, torch::jit::script::Module*) + 1792 (0x11b1d90b0 in libtorch.dylib)
frame #3: torch::jit::tracer::createGraphByTracing(pybind11::function const&, std::__1::vector<c10::IValue, std::__1::allocator<c10::IValue> >, pybind11::function const&, bool, torch::jit::script::Module*) + 361 (0x1121829b9 in libtorch_python.dylib)
frame #4: void pybind11::cpp_function::initialize<torch::jit::script::initJitScriptBindings(_object*)::$_16, void, torch::jit::script::Module&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, pybind11::function, pybind11::tuple, pybind11::function, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(torch::jit::script::initJitScriptBindings(_object*)::$_16&&, void (*)(torch::jit::script::Module&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, pybind11::function, pybind11::tuple, pybind11::function, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::__invoke(pybind11::detail::function_call&) + 319 (0x1121bd20f in libtorch_python.dylib)
frame #5: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 3324 (0x111c9f3fc in libtorch_python.dylib)
<omitting python frames>
frame #61: start + 1 (0x7fff6fa6d3d5 in libdyld.dylib)
frame #62: 0x0 + 2 (0x2 in ???)

I just now realized that ONNX export does not call into torch.jit.trace, but torch.jit.get_trace_graph. Hum, this is unfortunate :-/

stereomatchingkiss commented 4 years ago

@XushengLee adding support for TorchScript for all models in torchvision is in the plans, but it will still take a few months before we are there.

Any progress about support the detection models by JIT?Thanks

fmassa commented 4 years ago

@stereomatchingkiss Yes, it's almost ready, just need to fix some unrelated ONNX issues and it will be merged this week

stereomatchingkiss commented 4 years ago

@stereomatchingkiss Yes, it's almost ready, just need to fix some unrelated ONNX issues and it will be merged this week

Thanks, glad to hear that, could we convert the model to onnx format after this merged?

fmassa commented 4 years ago

@stereomatchingkiss ONNX and JIT support for Mask R-CNN in torchvision has been merged into master, and is available if you compile from source.

cted18 commented 4 years ago

I still cannot trace the Maskrcnn model from the latest branch.

I get this error out of the box:

scale = 2 ** float(torch.tensor(approx_scale).log2().round()) RuntimeError: log2_vml_cpu not implemented for 'Long' Then I make changes suggested by @soumith

this actually looks like a bug in scale = 2 ** torch.tensor(approx_scale).log2().round().item() in torchvision/ops/poolers.py.

If approx_scale here is an exact integer, the tensor will be a LongTensor, which is unexpected.

That should be changed to torch.tensor(approx_scale, dtype=torch.float32)

Now I have this:

File "/../python3.6/site-packages/torchvision-0.5.0a0+5b1716a-py3.6-linux-x86_64.egg/torchvision/ops/poolers.py", line 164, in setup_scales self.map_levels = initLevelMapper(int(lvl_min), int(lvl_max)) OverflowError: cannot convert float infinity to integer

fmassa commented 4 years ago

@cted18 can you print torchvision.__version__? I suspect you are in an old version

cted18 commented 4 years ago

Sure.

torchvision.__version__ '0.5.0a0+5b1716a'

I just built it from the master.

fmassa commented 4 years ago

@cted18 can you share a script that reproduces the error you have?

cted18 commented 4 years ago

I am trying to accelerate the maskrcnn_resnet50_fpn pretrained model using JIT tracing provided by pytorch. It appears that some operations present in this model are not supported by pytorch JIT.

Are these models supposed to have JIT support officially? If not, would you be able to provide advice for a workaround?

To replicate, running:

import torch
import torchvision
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.eval()
traced_net = torch.jit.trace(model, torch.rand(1, 3,800, 800))

produces

RuntimeError: log2_vml_cpu not implemented for 'Long

Thank you.

Yes. It is the exact same script as from @rbrigden

Ubuntu 16.04 python 3.6.7 torch.version '1.3.0a0+de394b6' torchvision.version '0.5.0a0+cec7ea7'

fmassa commented 4 years ago

@cted18 this should be fixed when https://github.com/pytorch/vision/pull/1639 get's merged

stereomatchingkiss commented 4 years ago

Still cannot convert fasterrcnn_resnet50_fpn

Version(print(torchvision.version)) :

0.5.0.dev20191206

Codes:

import torch
import torchvision

print(torchvision.__version__)

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(True)
model.eval()
example = torch.rand(1, 3, 300, 400)
traced_script_module = torch.jit.trace(model, example)

Error messages:

RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results). 'incorrect results).', category=RuntimeWarning) Traceback (most recent call last): File "pytorch_conversion.py", line 14, in traced_script_module = torch.jit.trace(model, example) File "C:\Users\yyyy\Anaconda3\envs\pytorch_preview\lib\site-packages\torch\jit__init.py", line 877, in trace check_tolerance, _force_outplace, _module_class) File "C:\Users\yyyy\Anaconda3\envs\pytorch_preview\lib\site-packages\torch\jit\init__.py", line 1029, in trace_module module._c._create_method_from_trace(method_name, func, example_inputs, var_lookup_fn, _force_outplace) RuntimeError: Only tensors or tuples of tensors can be output from traced functions (getOutput at ..\torch\csrc\jit\tracer.cpp:212) (no backtrace available)

OS : windows 10 64bits installed by anaconda :

conda create --name pytorch_n python=3.7 conda activate pytorch_n conda install pytorch torchvision cudatoolkit=10.1 -c pytorch-nightly -c defaults -c conda-forge

Models I need:

keypointrcnn_resnet50_fpn, fasterrcnn_resnet50_fpn

fmassa commented 4 years ago

@stereomatchingkiss use torch.jit.script instead of torch.jit.trace, and it should work.

model = torch.jit.script(model)
stereomatchingkiss commented 4 years ago

@stereomatchingkiss use torch.jit.script instead of torch.jit.trace, and it should work.

model = torch.jit.script(model)

Thanks, this work, but fail to load the model of fasterrcnn_resnet50_fpn by the c++ api. OS : ubuntu18.0.4.3 LTS 64bits libtorch : nightly(2019/12/07)

main.cpp

#include <torch/script.h>

#include <iostream>
#include <memory>

int main(int argc, const char* argv[])
{
    if(argc != 2){
        std::cerr << "usage: example-app <path-to-exported-script-module>\n";
        return -1;
    }

    torch::jit::script::Module module;
    try {
        // Deserialize the ScriptModule from a file using torch::jit::load().
        module = torch::jit::load(argv[1]);
    }
    catch (const c10::Error& e) {
        std::cerr << "error loading the model\n";
        return -1;
    }

    std::cout << "ok\n";
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.5)

project(pytorch_test LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

find_package(Torch REQUIRED)

add_executable(pytorch_test main.cpp)
target_link_libraries(pytorch_test "${TORCH_LIBRARIES}")
set_property(TARGET pytorch_test PROPERTY CXX_STANDARD 14)

Error message:

terminate called after throwing an instance of 'torch::jit::script::ErrorReport'
  what():  
Unknown builtin op: torchvision::_new_empty_tensor_op.
Could not find any similar ops to torchvision::_new_empty_tensor_op. This op may not exist or may not be currently supported in TorchScript.
:
  File "C:\Users\yyyy\Anaconda3\envs\pytorch_preview\lib\site-packages\torchvision\ops\new_empty_tensor.py", line 16
        output (Tensor)
    """
    return torch.ops.torchvision._new_empty_tensor_op(x, shape)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
Serialized   File "code/__torch__/torchvision/ops/new_empty_tensor.py", line 4
def _new_empty_tensor(x: Tensor,
    shape: List[int]) -> Tensor:
  _0 = ops.torchvision._new_empty_tensor_op(x, shape)
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  return _0
'_new_empty_tensor' is being compiled since it was called from 'interpolate'
Serialized   File "code/__torch__/torchvision/ops/misc.py", line 25
    align_corners: Optional[bool]=None) -> Tensor:
  _1 = __torch__.torchvision.ops.misc._output_size
  _2 = __torch__.torchvision.ops.new_empty_tensor._new_empty_tensor
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  _3 = uninitialized(Tensor)
  if torch.gt(torch.numel(input), 0):
'interpolate' is being compiled since it was called from 'GeneralizedRCNNTransform.resize'
Serialized   File "code/__torch__/torchvision/models/detection/transform.py", line 79
    target: Optional[Dict[str, Tensor]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
    _18 = __torch__.torchvision.models.detection.transform.resize_boxes
    _19 = __torch__.torchvision.ops.misc.interpolate
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _20 = __torch__.torchvision.models.detection.transform.resize_keypoints
    _21 = uninitialized(Tuple[Tensor, Optional[Dict[str, Tensor]]])
'GeneralizedRCNNTransform.resize' is being compiled since it was called from 'GeneralizedRCNNTransform.forward'
  File "C:\Users\yyyy\Anaconda3\envs\pytorch_preview\lib\site-packages\torchvision\models\detection\transform.py", line 47
                                 "of shape [C, H, W], got {}".format(image.shape))
            image = self.normalize(image)
            image, target_index = self.resize(image, target_index)
                                  ~~~~~~~~~~~ <--- HERE
            images[i] = image
            if targets is not None and target_index is not None:
Serialized   File "code/__torch__/torchvision/models/detection/transform.py", line 29
        pass
      image0 = (self).normalize(image, )
      _2 = (self).resize(image0, target_index, )
                                 ~~~~~~~~~~~~ <--- HERE
      image1, target_index0, = _2
      _3 = torch._set_item(images0, i, image1)

Aborted (core dumped)

Edit : I download the cpp package(cpu only) about one hour ago.

stereomatchingkiss commented 4 years ago

@stereomatchingkiss use torch.jit.script instead of torch.jit.trace, and it should work.

model = torch.jit.script(model)

I found a solution from issue #1407, but I have another question, how could I know which op I need to register? Or I should not care about this part because in the future these op would not need to register by the end users? Thanks

static auto registry =
        torch::RegisterOperators()
                .op("torchvision::nms", &nms)
                .op("torchvision::roi_align(Tensor input, Tensor rois, float spatial_scale, int pooled_height, int pooled_width, int sampling_ratio) -> Tensor",
                    &roi_align)
                .op("torchvision::roi_pool", &roi_pool)
                .op("torchvision::_new_empty_tensor_op", &new_empty_tensor)
                .op("torchvision::ps_roi_align", &ps_roi_align)
                .op("torchvision::ps_roi_pool", &ps_roi_pool);
fmassa commented 4 years ago

@stereomatchingkiss

how could I know which op I need to register?

that's a good question. I don't yet have a good answer for that, I'll discuss with @eellison to see if we can find a good solution to it

stereomatchingkiss commented 4 years ago

@stereomatchingkiss

how could I know which op I need to register?

that's a good question. I don't yet have a good answer for that, I'll discuss with @eellison to see if we can find a good solution to it

When I copy the codes, I find another question, where could I find following headers

#include "torchvision/PSROIAlign.h"
#include "torchvision/PSROIPool.h"
#include "torchvision/ROIAlign.h"
#include "torchvision/ROIPool.h"
#include "torchvision/empty_tensor_op.h"
#include "torchvision/nms.h"

Are they generated when I compiled from source?

stereomatchingkiss commented 4 years ago

@stereomatchingkiss

how could I know which op I need to register?

that's a good question. I don't yet have a good answer for that, I'll discuss with @eellison to see if we can find a good solution to it

Check issue #1407 again, looks like I need to change the make file and compile it by myself in order to generate the files. Any good news of using the models by c++ api?

fmassa commented 4 years ago

@stereomatchingkiss

Any good news of using the models by c++ api?

We will be improving the experience of using the torchvision models with the C++ API over time. We have just enabled support for Mask R-CNN models to be torchscripted, and will be refining the C++ export over time

cted18 commented 4 years ago

@fmassa I can script Maskrcnn parts and load them in cpp using this

model = models.detection.maskrcnn_resnet50_fpn(pretrained=True).eval()
backbone_script = torch.jit.script(model.backbone)

but when I add a wrapper around the attributes (backbone eg.) and load it on cpp, it cannot find torchvision operators. Why might this happen?

class BackboneWrapper(torch.nn.Module):
    def __init__(self, model):
        super(BackboneWrapper, self).__init__()
        self.transform = model.transform
        self.backbone = model.backbone

    def forward(self, images, targets=None):
        # type: (List[Tensor], Optional[List[Dict[str, Tensor]]]) -> Dict[str, Dict[str, Tensor]]
        images, _ = self.transform(images, targets)
        features = self.backbone(images.tensors)
        return {'features': features}

Error:

Unknown builtin op: torchvision::_new_empty_tensor_op.
Could not find any similar ops to torchvision::_new_empty_tensor_op. This op may not exist or may not be currently supported in TorchScript.
: torchvision-0.4.2-py3.6-linux-x86_64.egg/torchvision/ops/new_empty_tensor.py", line 16
        output (Tensor)
    """
    return torch.ops.torchvision._new_empty_tensor_op(x, shape)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
Serialized   File "code/__torch__/torchvision/ops/new_empty_tensor.py", line 4
fmassa commented 4 years ago

@cted18 I believe the solution you are looking for can be found in https://github.com/pytorch/vision/issues/1002#issuecomment-562915463 and https://github.com/pytorch/vision/pull/1407#issuecomment-563048240

If you are still facing issues, can you open a new issue with a full reproducible example of the problem?

cted18 commented 4 years ago

@fmassa I used the comments from #1407, but the problem still exists. Opened a new issue #1730 Thanks

WaterKnight1998 commented 4 years ago

@cted18 I believe the solution you are looking for can be found in #1002 (comment) and #1407 (comment)

If you are still facing issues, can you open a new issue with a full reproducible example of the problem?

@fmassa How could I get a torchscript version of torchvision.models.detection. maskrcnn_resnet50_fpn?

torch.jit.script and torch.jit.tarce are not working with this model

With torch.jit.script

model = torch.load(modelname+"-best.pth")
model=model.cuda()
model.eval()
print(img)
with torch.no_grad():
    print(model(img))
    traced_cell = torch.jit.script(model, (img))
torch.jit.save(traced_cell, modelname+"-torchscript.pth")

loaded_trace = torch.jit.load(modelname+"-torchscript.pth")
loaded_trace.eval()
with torch.no_grad():
    print(loaded_trace(img))

TensorMask(torch.argmax(loaded_trace(img),1)).show()

Output:

TensorImage([[[[0.8961, 0.9132, 0.8789,  ..., 0.2453, 0.1939, 0.2282],
          [0.8276, 0.9132, 0.8618,  ..., 0.2282, 0.1939, 0.2282],
          [0.8961, 0.9132, 0.8789,  ..., 0.2282, 0.2282, 0.2453],
          ...,
          [0.8961, 0.8618, 0.9132,  ..., 0.4508, 0.4166, 0.3994],
          [0.9303, 0.9132, 0.9474,  ..., 0.4166, 0.4166, 0.4508],
          [0.9646, 0.8789, 0.9303,  ..., 0.3994, 0.3994, 0.3994]],

         [[1.0455, 1.0630, 1.0280,  ..., 0.3803, 0.3277, 0.3627],
          [0.9755, 1.0630, 1.0105,  ..., 0.3627, 0.3277, 0.3627],
          [1.0455, 1.0630, 1.0280,  ..., 0.3627, 0.3627, 0.3803],
          ...,
          [1.0455, 1.0105, 1.0630,  ..., 0.5903, 0.5553, 0.5378],
          [1.0805, 1.0630, 1.0980,  ..., 0.5553, 0.5553, 0.5903],
          [1.1155, 1.0280, 1.0805,  ..., 0.5378, 0.5378, 0.5378]],

         [[1.2631, 1.2805, 1.2457,  ..., 0.6008, 0.5485, 0.5834],
          [1.1934, 1.2805, 1.2282,  ..., 0.5834, 0.5485, 0.5834],
          [1.2631, 1.2805, 1.2457,  ..., 0.5834, 0.5834, 0.6008],
          ...,
          [1.2631, 1.2282, 1.2805,  ..., 0.8099, 0.7751, 0.7576],
          [1.2980, 1.2805, 1.3154,  ..., 0.7751, 0.7751, 0.8099],
          [1.3328, 1.2457, 1.2980,  ..., 0.7576, 0.7576, 0.7576]]]],
       device='cuda:0')
[{'boxes': tensor([[412.5222, 492.3208, 619.7662, 620.9233]], device='cuda:0'), 'labels': tensor([1], device='cuda:0'), 'scores': tensor([0.1527], device='cuda:0'), 'masks': tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0')}]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-7216a0dac5a0> in <module>
     12 loaded_trace.eval()
     13 with torch.no_grad():
---> 14     print(loaded_trace(img))
     15 
     16 TensorMask(torch.argmax(loaded_trace(img),1)).show()

~/anaconda3/envs/pro1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    556             result = self._slow_forward(*input, **kwargs)
    557         else:
--> 558             result = self.forward(*input, **kwargs)
    559         for hook in self._forward_hooks.values():
    560             hook_result = hook(self, input, result)

RuntimeError: forward() Expected a value of type 'List[Tensor]' for argument 'images' but instead found type 'TensorImage'.
Position: 1
Value: TensorImage([[[[0.8961, 0.9132, 0.8789,  ..., 0.2453, 0.1939, 0.2282],
          [0.8276, 0.9132, 0.8618,  ..., 0.2282, 0.1939, 0.2282],
          [0.8961, 0.9132, 0.8789,  ..., 0.2282, 0.2282, 0.2453],
          ...,
          [0.8961, 0.8618, 0.9132,  ..., 0.4508, 0.4166, 0.3994],
          [0.9303, 0.9132, 0.9474,  ..., 0.4166, 0.4166, 0.4508],
          [0.9646, 0.8789, 0.9303,  ..., 0.3994, 0.3994, 0.3994]],

         [[1.0455, 1.0630, 1.0280,  ..., 0.3803, 0.3277, 0.3627],
          [0.9755, 1.0630, 1.0105,  ..., 0.3627, 0.3277, 0.3627],
          [1.0455, 1.0630, 1.0280,  ..., 0.3627, 0.3627, 0.3803],
          ...,
          [1.0455, 1.0105, 1.0630,  ..., 0.5903, 0.5553, 0.5378],
          [1.0805, 1.0630, 1.0980,  ..., 0.5553, 0.5553, 0.5903],
          [1.1155, 1.0280, 1.0805,  ..., 0.5378, 0.5378, 0.5378]],

         [[1.2631, 1.2805, 1.2457,  ..., 0.6008, 0.5485, 0.5834],
          [1.1934, 1.2805, 1.2282,  ..., 0.5834, 0.5485, 0.5834],
          [1.2631, 1.2805, 1.2457,  ..., 0.5834, 0.5834, 0.6008],
          ...,
          [1.2631, 1.2282, 1.2805,  ..., 0.8099, 0.7751, 0.7576],
          [1.2980, 1.2805, 1.3154,  ..., 0.7751, 0.7751, 0.8099],
          [1.3328, 1.2457, 1.2980,  ..., 0.7576, 0.7576, 0.7576]]]],
       device='cuda:0')
Declaration: forward(__torch__.torchvision.models.detection.mask_rcnn.___torch_mangle_1723.MaskRCNN self, Tensor[] images, Dict(str, Tensor)[]? targets=None) -> ((Dict(str, Tensor), Dict(str, Tensor)[]))
Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)

With torch.jit.trace

modelname="maskrcnn"
model = torch.load(modelname+"-best.pth")
model=model.cuda()
model.eval()
print(img)
with torch.no_grad():
    print(model(img))
    traced_cell = torch.jit.trace(model, (img))
torch.jit.save(traced_cell, modelname+"-torchscript.pth")

loaded_trace = torch.jit.load(modelname+"-torchscript.pth")
loaded_trace.eval()
with torch.no_grad():
    print(loaded_trace(img))

TensorMask(torch.argmax(loaded_trace(img),1)).show()

Output

TensorImage([[[[0.8961, 0.9132, 0.8789,  ..., 0.2453, 0.1939, 0.2282],
          [0.8276, 0.9132, 0.8618,  ..., 0.2282, 0.1939, 0.2282],
          [0.8961, 0.9132, 0.8789,  ..., 0.2282, 0.2282, 0.2453],
          ...,
          [0.8961, 0.8618, 0.9132,  ..., 0.4508, 0.4166, 0.3994],
          [0.9303, 0.9132, 0.9474,  ..., 0.4166, 0.4166, 0.4508],
          [0.9646, 0.8789, 0.9303,  ..., 0.3994, 0.3994, 0.3994]],

         [[1.0455, 1.0630, 1.0280,  ..., 0.3803, 0.3277, 0.3627],
          [0.9755, 1.0630, 1.0105,  ..., 0.3627, 0.3277, 0.3627],
          [1.0455, 1.0630, 1.0280,  ..., 0.3627, 0.3627, 0.3803],
          ...,
          [1.0455, 1.0105, 1.0630,  ..., 0.5903, 0.5553, 0.5378],
          [1.0805, 1.0630, 1.0980,  ..., 0.5553, 0.5553, 0.5903],
          [1.1155, 1.0280, 1.0805,  ..., 0.5378, 0.5378, 0.5378]],

         [[1.2631, 1.2805, 1.2457,  ..., 0.6008, 0.5485, 0.5834],
          [1.1934, 1.2805, 1.2282,  ..., 0.5834, 0.5485, 0.5834],
          [1.2631, 1.2805, 1.2457,  ..., 0.5834, 0.5834, 0.6008],
          ...,
          [1.2631, 1.2282, 1.2805,  ..., 0.8099, 0.7751, 0.7576],
          [1.2980, 1.2805, 1.3154,  ..., 0.7751, 0.7751, 0.8099],
          [1.3328, 1.2457, 1.2980,  ..., 0.7576, 0.7576, 0.7576]]]],
       device='cuda:0')
[{'boxes': tensor([[412.5222, 492.3208, 619.7662, 620.9233]], device='cuda:0'), 'labels': tensor([1], device='cuda:0'), 'scores': tensor([0.1527], device='cuda:0'), 'masks': tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0')}]
/opt/conda/conda-bld/pytorch_1587452831668/work/torch/csrc/utils/python_arg_parser.cpp:760: UserWarning: This overload of nonzero is deprecated:
    nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
    nonzero(Tensor input, *, bool as_tuple)
/home/david/anaconda3/envs/proy/lib/python3.7/site-packages/torch/tensor.py:467: RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
  'incorrect results).', category=RuntimeWarning)
/home/david/anaconda3/envs/proy/lib/python3.7/site-packages/fastai2/torch_core.py:272: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  res = getattr(super(TensorBase, self), fn)(*args, **kwargs)
/opt/conda/conda-bld/pytorch_1587452831668/work/aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
/home/david/anaconda3/envs/proy/lib/python3.7/site-packages/torchvision/models/detection/rpn.py:164: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  torch.tensor(image_size[1] / g[1], dtype=torch.int64, device=device)] for g in grid_sizes]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-44b7a9360e87> in <module>
      6 with torch.no_grad():
      7     print(model(img))
----> 8     traced_cell = torch.jit.trace(model, (img))
      9 torch.jit.save(traced_cell, modelname+"-torchscript.pth")
     10 

~/anaconda3/envs/proy/lib/python3.7/site-packages/torch/jit/__init__.py in trace(func, example_inputs, optimize, check_trace, check_inputs, check_tolerance, strict, _force_outplace, _module_class, _compilation_unit)
    881         return trace_module(func, {'forward': example_inputs}, None,
    882                             check_trace, wrap_check_inputs(check_inputs),
--> 883                             check_tolerance, strict, _force_outplace, _module_class)
    884 
    885     if (hasattr(func, '__self__') and isinstance(func.__self__, torch.nn.Module) and

~/anaconda3/envs/proy/lib/python3.7/site-packages/torch/jit/__init__.py in trace_module(mod, inputs, optimize, check_trace, check_inputs, check_tolerance, strict, _force_outplace, _module_class, _compilation_unit)
   1035             func = mod if method_name == "forward" else getattr(mod, method_name)
   1036             example_inputs = make_tuple(example_inputs)
-> 1037             module._c._create_method_from_trace(method_name, func, example_inputs, var_lookup_fn, strict, _force_outplace)
   1038             check_trace_method = module._c._get_method(method_name)
   1039 

~/anaconda3/envs/proy/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    554                 input = result
    555         if torch._C._get_tracing_state():
--> 556             result = self._slow_forward(*input, **kwargs)
    557         else:
    558             result = self.forward(*input, **kwargs)

~/anaconda3/envs/proy/lib/python3.7/site-packages/torch/nn/modules/module.py in _slow_forward(self, *input, **kwargs)
    540                 recording_scopes = False
    541         try:
--> 542             result = self.forward(*input, **kwargs)
    543         finally:
    544             if recording_scopes:

~/anaconda3/envs/proy/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
     68         if isinstance(features, torch.Tensor):
     69             features = OrderedDict([('0', features)])
---> 70         proposals, proposal_losses = self.rpn(images, features, targets)
     71         detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
     72         detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

~/anaconda3/envs/proy/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    554                 input = result
    555         if torch._C._get_tracing_state():
--> 556             result = self._slow_forward(*input, **kwargs)
    557         else:
    558             result = self.forward(*input, **kwargs)

~/anaconda3/envs/proy/lib/python3.7/site-packages/torch/nn/modules/module.py in _slow_forward(self, *input, **kwargs)
    540                 recording_scopes = False
    541         try:
--> 542             result = self.forward(*input, **kwargs)
    543         finally:
    544             if recording_scopes:

~/anaconda3/envs/proy/lib/python3.7/site-packages/torchvision/models/detection/rpn.py in forward(self, images, features, targets)
    486         proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
    487         proposals = proposals.view(num_images, -1, 4)
--> 488         boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
    489 
    490         losses = {}

~/anaconda3/envs/proy/lib/python3.7/site-packages/torchvision/models/detection/rpn.py in filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level)
    392 
    393         # select top_n boxes independently per level before applying nms
--> 394         top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)
    395 
    396         image_range = torch.arange(num_images, device=device)

~/anaconda3/envs/proy/lib/python3.7/site-packages/torchvision/models/detection/rpn.py in _get_top_n_idx(self, objectness, num_anchors_per_level)
    372                 pre_nms_top_n = min(self.pre_nms_top_n(), num_anchors)
    373             _, top_n_idx = ob.topk(pre_nms_top_n, dim=1)
--> 374             r.append(top_n_idx + offset)
    375             offset += num_anchors
    376         return torch.cat(r, dim=1)

RuntimeError: expected device cuda:0 but got device cpu
ptrblck commented 4 years ago

@WaterKnight1998 's issue also tracked here with potential solution.

fmassa commented 4 years ago

@WaterKnight1998 to complement @ptrblck comment, it seems that your input is a TensorImage (which is not something that we provide in torchvision I believe) If you pass instead a list of 3d tensors, it should work.