hollance / coreml-survival-guide

Source code for the book Core ML Survival Guide
MIT License
252 stars 36 forks source link

Dimension 1 of input confidence (-1880735433) is not consistent with the number of classes (80) #2

Closed glenn-jocher closed 5 years ago

glenn-jocher commented 5 years ago

I have a yolov3.onnx model I've converted to coreml, and pipelined with an nms model and dropped into the example ObjectDetection app. The process seemed smooth enough, but when running on my iPhoneXS 12.1 I get the following error. I decode natively in yolov3, and I send these boxes and scores of dimension [4, 507, 1] and [80, 507, 1] from my yolov3.mlmodel to my nms.mlmodel and the pipeline compile goes fine.

Your blog post seems to imply that your decoder model is sending these out to the nms in the opposite permutation though, which confuses me:

Now we can permute this from (4, 1917, 1) to (1, 1917, 4) and write the results to the second output of the decoder model, "raw_coordinates":

I have only two models in the pipeline, 0 (yolov3) and 1 (nms), so nms is not getting the shape it wants apparently according to the error. Is there a way to load the pipeline and debug the input/output shape directly? Netron viewer isnt very informative on the pipeline model: https://storage.googleapis.com/ultralytics/yolov3_tiny_pipelined.mlmodel (15MB)

Failed to perform Vision request: Error Domain=com.apple.vis Code=3 "The VNCoreMLTransform request failed" UserInfo={NSLocalizedDescription=The VNCoreMLTransform request failed, NSUnderlyingError=0x283b1dc20 {Error Domain=com.apple.CoreML Code=0 "Failed to evaluatue model 1 in pipeline" UserInfo={NSLocalizedDescription=Failed to evaluatue model 1 in pipeline, NSUnderlyingError=0x283b1d860 {Error Domain=com.apple.CoreML Code=0 "Dimension 1 of input confidence (-1880735433) is not consistent with the number of classes (80)" UserInfo={NSLocalizedDescription=Dimension 1 of input confidence (-1880735433) is not consistent with the number of classes (80)}}}}}

hollance commented 5 years ago

Indeed the first dimension must be the number of boxes, so you'll need to add a permute layer at the end of the model that swaps the dimensions.

It's true that Netron isn't super clear on how it displays the pipelines, so I prefer to look at the model directly using Python and coremltools.

You can load the pipeline with coremltools, then look at each individual model in the pipeline and their input / output shapes, and even add the permute layer here if you want to. All of these things are explained in the book. :-)

I'm not sure where the strange number -1880735433 comes from, though. That's probably Core ML getting confused.

glenn-jocher commented 5 years ago

Ah I see. Yes you are right, the NMS model wants the inputs permuted. I'm still getting the same error message unfortunately after doing that. Is there a way to actually run the coreml model using coremltools? If I could run an example image through it and put breakpoints in the process to observe the shapes of the variables that would be super useful.

hollance commented 5 years ago

Yes you can. model.predict({"input": image}) where model refers to the model or one of the pipeline models and image is a PIL image object. To see the shapes of the intermediate layers, you can look at the output from the Core ML compiler. Although if this says ? for a shape, you will need to remove layers from the end of the mlmodel in order to look at the intermediate shapes.

glenn-jocher commented 5 years ago

Thanks! After trying this, it appears the coreml model is expecting a 416x416 letterboxed image rather than the 1280x720 image I tried to feed it. I was always confused about where this letterboxing/resizing takes place in the pipeline since I didn't see it explicitly in the blog post. Could this be something tfcoreml is doing as a preprocess that I'm lacking since I'm using a PyTorch to onnx_coreml workflow?

from PIL import Image
img = Image.open('zidane.jpg')
out = yolov3_model.predict({'0': img})
... 
Traceback (most recent call last):
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydevd_bundle/pydevd_exec.py", line 3, in Exec
    exec exp in global_vars, local_vars
  File "<string>", line 3, in <module>
  File "/Users/glennjocher/PycharmProjects/onnx-coreml/venv/lib/python2.7/site-packages/coremltools/models/model.py", line 327, in predict
    return self.__proxy__.predict(data,useCPUOnly)
RuntimeError: {
    NSLocalizedDescription = "Input image feature 0 does not match model description";
    NSUnderlyingError = "Error Domain=com.apple.CoreML Code=0 \"Image size 1280 x 720 not in allowed set of image sizes\" UserInfo={NSLocalizedDescription=Image size 1280 x 720 not in allowed set of image sizes}";
}

If I letterbox to 416 pixels everything works... and shows me I lost a class somehow (79 should be 80), and an anchor (506 should be 507)! I can work on fixing that, but what about resizing the iPhone camera feed to 416?

img = Image.open('zidane_416.jpg')
out = yolov3_model.predict({'0': img})
out['133'].shape
(1, 1, 1, 506, 79)
out['135'].shape
(1, 1, 1, 506, 4)
hollance commented 5 years ago

Could be. TF models sometimes include the resizing (and normalization) but not always. In the blog post the resizing is automatically handled by the Vision framework (a good reason for using Vision).

glenn-jocher commented 5 years ago

Ah I understand! I've got everything working now, except that I discovered a major issue: broadcasting is not supported in CoreML (i.e. I can't mulitply my [1, 507, 80] prob_cls with my [1, 507, 1] prob_obj like I can in PyTorch).

If I try and export this operation from PyTorch the export fails. Since there are no for loops in CoreML, and apparently Tile operations are not supported during export either (NotImplementedError: Unsupported ONNX ops of type: Tile), would it be possible to build a simple decoder model that incorporates this multiplication operation (multiplying all 80 prob_cls at each anchor by their corresponding prob_obj?

hollance commented 5 years ago

The way the Turi Create model does this, is to use a concat layer that repeats the prob_obj K times where K is the number of classes (so in your case 80 times). And then it multiplies the output of that concat layer with prob_cls.

glenn-jocher commented 5 years ago

Ah I see. I'll try that. Thanks!!