ONNX Tutorial: filter.dim32(i + 2) == kernel_[i]

Pavel-Akapian commented 7 years ago

Hello! We're trying to replicate PyTorch ONNX Super-Resolution Tutorial . Conversion seems to work OK. But when deploying model to iOS an error occurs (on predictor->run):

[MC] Reading from public effective user settings. 
libc++abi.dylib: terminating with uncaught exception of type caffe2::EnforceNotMet: [enforce fail at conv_op_impl.h:37] filter.dim32(i + 2) == kernel_[i].  Error from operator:  
input: "9" input: "1" output: "11" name: "" type: "Conv" arg { name: "kernels" ints: 5 ints: 5 } arg { name: "strides" ints: 1 ints: 1 } arg { name: "pads" ints: 2 ints: 2 ints: 2 ints: 2 } arg { name: "dilations" ints: 1 ints: 1 } arg { name: "group" i: 1 }

We can run original caffe2 models on device. When we compared manually written caffe2 models and the model made by conversion tool, we noticed conversion tool adds (maybe it could help to fix this issue?):

device_option {
  device_type: 0
  cuda_gpu_id: 0
}

Also this problem replicates on more simple examples.

prigoyal commented 7 years ago

@Pavel-Akapian thank you for trying out the tutorial. To be able to help further, I need some information on how you are running the model.

can you describe what super-resolution model version are you using? The tutorial highlights 1) a small model that is also available in pytorch examples and 2) the SRResNet model

what image processing did you use if any at all? what is the image input dimension to the model?

The error seems to indicate that the input data is not what the model is expecting. Have you been able to successfully run the tutorial part until the mobile execution using pdb? That would be the first step to get right.

Pavel-Akapian commented 7 years ago

@prigoyal thank you for quick reply. I experiment with the model below:

class SuperResolutionNet(nn.Module):
    def __init__(self, upscale_factor, inplace=False):
        super(SuperResolutionNet, self).__init__()

        self.relu = nn.ReLU(inplace=inplace)
        self.conv1 = nn.Conv2d(1, 64, (5, 5), (1, 1), (2, 2))
        self.conv2 = nn.Conv2d(64, 64, (3, 3), (1, 1), (1, 1))
        self.conv3 = nn.Conv2d(64, 32, (3, 3), (1, 1), (1, 1))
        self.conv4 = nn.Conv2d(32, upscale_factor ** 2, (3, 3), (1, 1), (1, 1))
        self.pixel_shuffle = nn.PixelShuffle(upscale_factor)

        self._initialize_weights()

    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.relu(self.conv3(x))
        x = self.pixel_shuffle(self.conv4(x))
        return x

    def _initialize_weights(self):
        init.orthogonal(self.conv1.weight, init.calculate_gain('relu'))
        init.orthogonal(self.conv2.weight, init.calculate_gain('relu'))
        init.orthogonal(self.conv3.weight, init.calculate_gain('relu'))
        init.orthogonal(self.conv4.weight)

# Create the super-resolution model by using the above model definition.
torch_model = SuperResolutionNet(upscale_factor=3)

I don't get well what do you mean by two versions (I don't notice SSResNet). I can run the tutorial from very start to the lines

# Save the image, we will compare this with the output image from mobile device
final_img.save("./_static/img/cat_superres.jpg")

with no errors and 'cat_superres.jpg' is successfully created on server machine. Also i tried 'dummy-model' with only one layer (conv) with default initialization (no training) and get the same error. Shape is 1x3x224x224. No pre-processing.

const int predHeight = 224;
const int predWidth = 224;
const int crops = 1;
const int channels = 3;

input.Resize(std::vector<int>({crops, channels, predHeight, predWidth})); I'm going to upload pb files later. I expect it can help a lot.

Pavel-Akapian commented 7 years ago

@prigoyal Me and my collegues compared pb files and found how to resolve this problem. The end of predict_net is:

external_input: "1"
external_input: "2"
external_input: "3"
external_input: "4"
external_input: "5"
external_input: "6"
external_input: "7"
external_input: "8"
external_input: "9"
external_output: "27"
external_output: "_onnx_dummy1"
external_output: "_onnx_dummy2"

We need to move the last external_input:"9" before external_input:"1"

external_input: "9"
external_input: "1"
external_input: "2"
external_input: "3"
external_input: "4"
external_input: "5"
external_input: "6"
external_input: "7"
external_input: "8"
external_output: "27"
external_output: "_onnx_dummy1"
external_output: "_onnx_dummy2"

In fact image is loaded into the first external_input.

ezyang commented 7 years ago

OK, the fact that the PyTorch exporter places actual inputs at the end of the inputs list (rather than the beginning) is a known wart. onnx-caffe2 is able to handle this if you don't use the protobuf manually but we plan on fixing this. EDIT: This doesn't seem to be the actual problem here.

prigoyal commented 7 years ago

@Pavel-Akapian the issue is rather very simple. There is no need for modifying the pb here. This version of super-resolution model requires and input image of dim 1x1x224x224 and the reason for that is mentioned in the tutorial. The error you were getting is also indicating that the filter dim is not right. Can you please try it out by passing the correct input without modifying the pb manually?

Pavel-Akapian commented 7 years ago

@prigoyal this also happens with 1x1x224x224.

prigoyal commented 7 years ago

@Pavel-Akapian it's actually slightly weird that you were able to execute the nets until

# Save the image, we will compare this with the output image from mobile device
final_img.save("./_static/img/cat_superres.jpg")

as you mentioned and that didn't require any tampering with pb manually but executing on iOS needs that. Can you create a simple repro of the error so we can look into it further? Tampering with pb is not the right solution and should be figured out correctly. I am not able to repro this issue with tutorial yet. Also, were you able to rather deploy on android device following some adb instructions in tutorial?

Pavel-Akapian commented 7 years ago

Here's the pb's generated by exact execution of tutorial that can run on server and not on iOS. ('.txt' ending is fake so github can upload it) init_net.pb.txt predict_net.pb.txt @prigoyal Unfortunately, we aren't developing android device app at the current moment, so it would be difficult to compare it.

bwasti commented 7 years ago

Hey @Pavel-Akapian how are you running the network?

There are a couple ways to do it but I am guessing you are using the predictor API? This requires the external_input[0] (the first one) to be the input data. As you correctly determined, this was not what was created by the pytorch exporter.

You can try running the network instead by workspace.RunNet(predict_net) and populating the blob "9" to verify this

jerryzh168 commented 7 years ago

@bwasti we should update predictor to be more flexible, similar to this: https://github.com/onnx/onnx-caffe2/blob/master/onnx_caffe2/backend.py#L318-L321

bwasti commented 7 years ago

@jerryzh168 what do you mean by more flexible?

One thing that might be nice is a TensorMap that takes string->Tensor for inputs so we can use blob names instead of the quite annoying ordering of external_input

This is the culprit code btw: https://github.com/caffe2/caffe2/blob/master/caffe2/core/predictor.cc#L48

jerryzh168 commented 7 years ago

@bwasti Yeah, by flexible I mean the predictor shouldn't depend on the ordering of external_input. We can figure out what are the missing blobs by calling workspace.HasBlob. An extension of getting a string->Tensor map as input is nice to have too.

onnx / onnx-caffe2

ONNX Tutorial: filter.dim32(i + 2) == kernel_[i] #12