Export with ONNX Simplifier with --grid error

antlamon commented 3 years ago

🐛 Bug

An exported model as ONNX using --grid parameter cannot be used by onnx-runtime or simplified by onnx-simplifier A Mul Node triggers a shape inference error Incompatible dimensions

To Reproduce

Replace ONNX export in export.py with this code and run with command python3 models/export.py --grid

try:
        import onnx
        from onnxsim import simplify
        print('\nStarting ONNX export with onnx %s...' % onnx.__version__)
        f = opt.weights.replace('.pt', '.onnx')  # filename
        torch.onnx.export(model, img, f, verbose=False, opset_version=12, input_names=['images'],
                          output_names=['classes',
                                        'boxes'] if y is None else ['output'],
                          dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # size(1,3,640,640)
                                        'output': {0: 'batch', 2: 'y', 3: 'x'}} if opt.dynamic else None)

        # Checks
        onnx_model = onnx.load(f)  # load onnx model
        onnx.checker.check_model(onnx_model)  # check onnx model

        # This step triggers the error
        model_simp, check = simplify(onnx_model)
        onnx.save(model_simp, f)

        # print(onnx.helper.printable_graph(onnx_model.graph))  # print a human readable model
        print('ONNX export success, saved as %s' % f)
    except Exception as e:
        print('ONNX export failure: %s' % e)

Output:

Starting ONNX export with onnx 1.8.1...
ONNX export failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions

Expected behavior

Any yolov5 model exported as ONNX should be valid

github-actions[bot] commented 3 years ago

👋 Hello @antlamon, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 3 years ago

@antlamon thanks for the bug report. We don't generally provide support for code customizations and external package not in requirements.txt.

If an external package is causing an error you may also want to raise an issue with the package authors.

tommy2is commented 3 years ago

I would like to add that without any modifications to export.py, the --grid option also results in an unusable .onnx file when ran on the yolov5s.pt model. It works fine without the --grid option

During the running of the script, the following warning was produced (By torchscipt, not onnx though):

Starting TorchScript export with torch 1.8.0+cu101...
./models/yolo.py:48: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
/usr/local/lib/python3.7/dist-packages/torch/jit/_trace.py:940: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.
  _force_outplace,
TorchScript export success, saved as yolov5s.torchscript.pt

Attempting to run a inference session results in


Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from yolov5s.onnx failed:Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions```

Lucashsmello commented 3 years ago

This error occurred to me when exporting the onnx model using torch==1.8.1 with torchvision==0.9.1. When i export using torch==1.7.1, the loading of the onnx model works fine in both torch==1.7.1 and torch==1.8.1.

tommy2is commented 3 years ago

Thank you for pointing that out. That indeed was the issue

thestonehead commented 3 years ago

Will there be a fix, since torchvision==0.8.2 (required by torch 1.7.1) doesn't exist for windows?

TheanMS commented 3 years ago

I am also getting the same error,downgrading torch and torchvision versions didn't help me out to fix this issue.

timstokman commented 3 years ago

When I downgrade the pytorch version and export with --dynamic --grid, I can load the model, but it fails when doing inference on a (1, 3, 1088, 1920) tensor with this:

2021-04-15 20:26:39.079097920 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running Add node. Name:'Add_945' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60

Traceback (most recent call last): File "test_onnx.py", line 42, in outputs = sess.run(None, {input_name: image}) File "venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_945' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60

It does work with just --dynamic. I've seen some approaches where people re-implement the last layer with onnx. I guess that's probably the best approach for now.

glenn-jocher commented 3 years ago

@antlamon @Lucashsmello @thestonehead @timstokman we've integrated onnx-simplifier into export.py now in ONNX Simplifier PR #2815 and verified it's passing CI on all operating systems.

I'm not sure if this resolves the original issue, but hopefully it's a step in the right direction.

piotlinski commented 3 years ago

@antlamon @Lucashsmello @thestonehead @timstokman we've integrated onnx-simplifier into export.py now in ONNX Simplifier PR #2815 and verified it's passing CI on all operating systems.

I'm not sure if this resolves the original issue, but hopefully it's a step in the right direction.

@glenn-jocher Unfortunately, the problem still persists: I am using the docker image (version v5.0) and --grid causes ONNX export to fail on simplifying. The resulting onnx file cannot be used due to Incompatible dimensions error. However, rolling back pytorch to 1.8.0 (i.e. using the docker image v4.0 with latest repository version, which includes ONNX simplifier) works OK.

timstokman commented 3 years ago

@glenn-jocher I'm seeing the same issues, both --grid and --dynamic don't work with the simplifier. --grid export only seems to work in a few cases, even without the simplifier. I made a pull request for the "--dynamic" export issue: https://github.com/ultralytics/yolov5/pull/2856

glenn-jocher commented 3 years ago

@timstokman thanks for the PR, I'll take a look over there!

timstokman commented 3 years ago

To give a reproduction of the grid export issue now that the PR is merged:

python models/export.py --simplify --grid

Namespace(batch_size=1, device='cpu', dynamic=False, grid=True, img_size=[640, 640], simplify=True, weights='./yolov5s.pt')
YOLOv5 🚀 v5.0-15-g1df8c6c torch 1.8.1+cu102 CPU

Fusing layers... 
Model Summary: 224 layers, 7266973 parameters, 0 gradients, 17.0 GFLOPS

TorchScript: starting export with torch 1.8.1+cu102...
./models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as ./yolov5s.torchscript.pt
ONNX: starting export with onnx 1.9.0...
ONNX: simplifying with onnx-simplifier 0.3.5...
ONNX: simplifier failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions
ONNX: export success, saved as ./yolov5s.onnx
CoreML: export failure: No module named 'coremltools'

Export complete (4.56s). Visualize with https://github.com/lutzroeder/netron.

Without --simplify the model simply can't be loaded by the runtime.

It looks like the last layer has incompatible dimensions when exported.

glenn-jocher commented 3 years ago

@timstokman hmm, so the onnx runtime only succeeds with a --simplify model, but --simplify fails when --grid is also used?

timstokman commented 3 years ago

@glenn-jocher They both fail:

When using --grid without simplify, it generates a model that can't be loaded with onnxruntime. It fails with this error: 1 : FAIL : Load model from yolov5s.onnx failed:Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions.
When using --grid --simplify, the simplifier probably notices that the last layer has issues, and generates the exact same error: simplifier failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions

The root cause is in how the last layer is exported seemingly. Some sort of tensor dimension mismatch.

piotlinski commented 3 years ago

@ timstokman out of curiosity: what pytorch version are you using? EDIT: I see, 1.8.1, sry for the question

@glenn-jocher I managed to make simplify with grid work by rolling back pytorch to 1.8 (1.9 used in the latest docker image did not work, I don't know what happens if installed on host OS, not in docker)

Perhaps it's ONNX version that causes the issue? In the older yolov5 image (v4.0) it is 1.7.0 AFAIR

timstokman commented 3 years ago

@piotlinski I used the latest version, and the one you suggested. With pytorch 1.8 it works with the default options, but as soon as you use --dynamic or --img-size it stops working. With the latest version, it doesn't work at all.

piotlinski commented 3 years ago

@timstokman interesting, I tried pytorch 1.8 and can set img-size, (did not try dynamic though). I use the older version, where simplifier is always run. (the log says YOLOv5 v4.0, but I manually check out a newer commit)

docker exec -it yolov5 python models/export.py --weights /usr/src/model.pt --img 288 480 --batch-size 1 --grid
Namespace(batch_size=1, device='cpu', dynamic=False, grid=True, img_size=[288, 480], weights='/usr/src/model.pt')
YOLOv5 🚀 v4.0-207-gaff03be torch 1.8.0a0+1606899 CPU

Fusing layers... 
Model Summary: 224 layers, 7053910 parameters, 0 gradients, 16.3 GFLOPS

TorchScript: starting export with torch 1.8.0a0+1606899...
/usr/src/app/models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as /usr/src/model.torchscript.pt
ONNX: starting export with onnx 1.7.0...
ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
ONNX: export success, saved as /usr/src/model.onnx
CoreML: export failure: No module named 'coremltools'

Export complete (5.41s). Visualize with https://github.com/lutzroeder/netron.

EDIT: with --dynamic I get:

ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
ONNX: simplifier failure: The shape of input "images" has dynamic size "[0, 3, 0, 0]", please determine the input size manually by "--dynamic-input-shape --input-shape xxx" or "--input-shape xxx". Run "python3 -m onnxsim -h" for details

timstokman commented 3 years ago

@piotlinski Update to the latest yolo version to fix the error with --dynamic, and get the actual error.

piotlinski commented 3 years ago

@timstokman no error with --dynamic and latest version here, provided the same versions of libraries as above.

python models/export.py --weights /usr/src/model.pt --img 288 480 --grid --dynamic --batch-size 1
Namespace(batch_size=1, device='cpu', dynamic=True, grid=True, img_size=[288, 480], simplify=False, weights='/usr/src/model.pt')
YOLOv5 🚀 v5.0-17-gc949fc8 torch 1.8.0a0+1606899 CPU

Fusing layers...
Model Summary: 224 layers, 7053910 parameters, 0 gradients, 16.3 GFLOPS

TorchScript: starting export with torch 1.8.0a0+1606899...
/usr/src/app/models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as /usr/src/model.torchscript.pt
ONNX: starting export with onnx 1.7.0...
ONNX: export success, saved as /usr/src/model.onnx
CoreML: export failure: No module named 'coremltools'

Export complete (3.73s). Visualize with https://github.com/lutzroeder/netron.

when running with --simplify I get only some info

ONNX: starting export with onnx 1.7.0...
ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
ONNX: export success, saved as /usr/src/model.onnx

timstokman commented 3 years ago

Looks like the docker image also has different versions of onnx and onnx-simplifier. Maybe the requirements.txt of the yolo project needs to start pinning a few versions for this to work reliably.

@piotlinski Can you actually do inference with the exported model?

piotlinski commented 3 years ago

@timstokman the ones exported earlier (without the --dynamic flag) work OK. I haven't checked the dynamic models

jylink commented 3 years ago

I change this line in model/yolo.py, and then pass both --grid and --grid --simplify. Onnx runtime works fine too

# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * torch.tensor(self.anchor_grid[i].tolist()).float()  # wh

goderent commented 3 years ago

I found that the cause of the bug is the inconsistent behavior of the [i] symbol in pytorch and onnx. the shape of anchor_grid is (3,1,3,1,1,2), the shape of anchor_grid[i] is (1,3,1,1,2) in pytorch， but it is (1,1,3,1,1,2) in onnx. so we must clarify the shape of anchor_grid[i]. just modified the line in model/yolo.py:

# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2) # wh

jylink commented 3 years ago

btw, the exported onnx cannot be converted to tensorrt engine because subscript assignments generate unsupported ScatterND nodes. I rewrite the code to avoid generating ScatterND

# y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# z.append(y.view(bs, -1, self.no))
xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2)  # wh
rest = y[..., 4:]
yy = torch.cat((xy, wh, rest), -1)
z.append(yy.view(bs, -1, self.no))

timstokman commented 3 years ago

@jylink Tried your code, exporting works fine now, when I try to use dynamic axes it still seems to fail when running the model:

E onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_455' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60

Here I tried a tensor of 1088x1920x3 as input (stride 32 padded) for an image that was originally 1080x1920x3.

When using a fully padded tensor, 1920x1920x3, the predict layer does seem to work correctly, so this is a big improvement. I suggest you create a pull request for it.

Personally I still can't use --grid exports, dynamic axes gives me an almost 2x speed improvement and helps with CUDA memory usage.

jylink commented 3 years ago

@timstokman Hi, I found that the self.grid[i] mismatch the dynamic y[..., 0:2]. Dont know if it is the best way but I add a variable self.dynamic and pass all --dynamic, --grid, --simplify, onnxruntime and tensorrtEngine

# model/yolo.py
class Detect(nn.Module):
    stride = None  # strides computed during build
    export = False  # onnx export
    dynamic = False  # <--NEW
        ...
            if not self.training:  # inference
                if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:  # <--NEW
                    self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

                y = x[i].sigmoid()
                xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
                wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2)  # wh
                rest = y[..., 4:]
                y_ = torch.cat((xy, wh, rest), -1)
                z.append(y_.view(bs, -1, self.no))

# model/export.py
    model.model[-1].export = not opt.grid  # set Detect() layer grid export
    model.model[-1].dynamic = opt.dynamic  # <--NEW
    for _ in range(2):
        y = model(img)  # dry runs

Test:

# gen onnx
!python models/export.py --img 352 608 --batch 1 --dynamic --grid --simplify --weights weights/best.pt

# onnxruntime
sess = rt.InferenceSession('weights/best.onnx')
input_name = sess.get_inputs()[0].name
output_name = []
for output in sess.get_outputs():
    output_name.append(output.name)
for i in range(-5, 5):
    input = np.random.rand(1, 3, 608 + 32 * i, 608).astype(np.float32)
    pred = sess.run(output_name, {input_name: input})
    input = np.random.rand(1, 3, 608, 608 + 32 * i).astype(np.float32)
    pred = sess.run(output_name, {input_name: input})

timstokman commented 3 years ago

Yes, that fixes all the issues for me. Outputs seem exactly the same, with and without dynamic, and it works for different image sizes. Guess I can throw away my own numpy implementation of the detect layer. It also fixes the framework version compatibility issues. To me, the implementation seems good. Pull request time?

@glenn-jocher Looks like this fixes the remaining options with "--grid".

jylink commented 3 years ago

PR https://github.com/ultralytics/yolov5/pull/2982

glenn-jocher commented 3 years ago

@antlamon @tommy2is @timstokman good news 😃! Your original issue may now been fixed ✅ in merged PR #2982 by @jylink. To receive this update you can:

git pull from within your yolov5/ directory
git clone https://github.com/ultralytics/yolov5 again
Force-reload PyTorch Hub: model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
View our updated notebooks:

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ganleiboy commented 3 years ago

anchor_grid

so the bug is in self.anchor_grid not in self.grid hhhha. nice work!

ultralytics / yolov5