Closed antlamon closed 3 years ago
π Hello @antlamon, thank you for your interest in π YOLOv5! Please visit our βοΈ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a π Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training β Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7
. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.
@antlamon thanks for the bug report. We don't generally provide support for code customizations and external package not in requirements.txt.
If an external package is causing an error you may also want to raise an issue with the package authors.
I would like to add that without any modifications to export.py, the --grid option also results in an unusable .onnx file when ran on the yolov5s.pt model. It works fine without the --grid option
During the running of the script, the following warning was produced (By torchscipt, not onnx though):
Starting TorchScript export with torch 1.8.0+cu101...
./models/yolo.py:48: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
/usr/local/lib/python3.7/dist-packages/torch/jit/_trace.py:940: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.
_force_outplace,
TorchScript export success, saved as yolov5s.torchscript.pt
Attempting to run a inference session results in
Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from yolov5s.onnx failed:Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions```
This error occurred to me when exporting the onnx model using torch==1.8.1 with torchvision==0.9.1. When i export using torch==1.7.1, the loading of the onnx model works fine in both torch==1.7.1 and torch==1.8.1.
Thank you for pointing that out. That indeed was the issue
Will there be a fix, since torchvision==0.8.2 (required by torch 1.7.1) doesn't exist for windows?
I am also getting the same error,downgrading torch and torchvision versions didn't help me out to fix this issue.
When I downgrade the pytorch version and export with --dynamic --grid, I can load the model, but it fails when doing inference on a (1, 3, 1088, 1920) tensor with this:
2021-04-15 20:26:39.079097920 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running Add node. Name:'Add_945' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60
Traceback (most recent call last):
File "test_onnx.py", line 42, in
It does work with just --dynamic. I've seen some approaches where people re-implement the last layer with onnx. I guess that's probably the best approach for now.
@antlamon @Lucashsmello @thestonehead @timstokman we've integrated onnx-simplifier into export.py now in ONNX Simplifier PR #2815 and verified it's passing CI on all operating systems.
I'm not sure if this resolves the original issue, but hopefully it's a step in the right direction.
@antlamon @Lucashsmello @thestonehead @timstokman we've integrated onnx-simplifier into export.py now in ONNX Simplifier PR #2815 and verified it's passing CI on all operating systems.
I'm not sure if this resolves the original issue, but hopefully it's a step in the right direction.
@glenn-jocher Unfortunately, the problem still persists: I am using the docker image (version v5.0) and --grid
causes ONNX export to fail on simplifying. The resulting onnx file cannot be used due to Incompatible dimensions
error. However, rolling back pytorch to 1.8.0 (i.e. using the docker image v4.0 with latest repository version, which includes ONNX simplifier) works OK.
@glenn-jocher I'm seeing the same issues, both --grid and --dynamic don't work with the simplifier. --grid export only seems to work in a few cases, even without the simplifier. I made a pull request for the "--dynamic" export issue: https://github.com/ultralytics/yolov5/pull/2856
@timstokman thanks for the PR, I'll take a look over there!
To give a reproduction of the grid export issue now that the PR is merged:
python models/export.py --simplify --grid
Namespace(batch_size=1, device='cpu', dynamic=False, grid=True, img_size=[640, 640], simplify=True, weights='./yolov5s.pt')
YOLOv5 π v5.0-15-g1df8c6c torch 1.8.1+cu102 CPU
Fusing layers...
Model Summary: 224 layers, 7266973 parameters, 0 gradients, 17.0 GFLOPS
TorchScript: starting export with torch 1.8.1+cu102...
./models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as ./yolov5s.torchscript.pt
ONNX: starting export with onnx 1.9.0...
ONNX: simplifying with onnx-simplifier 0.3.5...
ONNX: simplifier failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions
ONNX: export success, saved as ./yolov5s.onnx
CoreML: export failure: No module named 'coremltools'
Export complete (4.56s). Visualize with https://github.com/lutzroeder/netron.
Without --simplify the model simply can't be loaded by the runtime.
It looks like the last layer has incompatible dimensions when exported.
@timstokman hmm, so the onnx runtime only succeeds with a --simplify model, but --simplify fails when --grid is also used?
@glenn-jocher They both fail:
1 : FAIL : Load model from yolov5s.onnx failed:Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions.
simplifier failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions
The root cause is in how the last layer is exported seemingly. Some sort of tensor dimension mismatch.
@ timstokman out of curiosity: what pytorch version are you using? EDIT: I see, 1.8.1, sry for the question
@glenn-jocher I managed to make simplify with grid work by rolling back pytorch to 1.8 (1.9 used in the latest docker image did not work, I don't know what happens if installed on host OS, not in docker)
Perhaps it's ONNX version that causes the issue? In the older yolov5 image (v4.0) it is 1.7.0 AFAIR
@piotlinski I used the latest version, and the one you suggested. With pytorch 1.8 it works with the default options, but as soon as you use --dynamic or --img-size it stops working. With the latest version, it doesn't work at all.
@timstokman interesting, I tried pytorch 1.8 and can set img-size, (did not try dynamic though). I use the older version, where simplifier is always run. (the log says YOLOv5 v4.0, but I manually check out a newer commit)
docker exec -it yolov5 python models/export.py --weights /usr/src/model.pt --img 288 480 --batch-size 1 --grid
Namespace(batch_size=1, device='cpu', dynamic=False, grid=True, img_size=[288, 480], weights='/usr/src/model.pt')
YOLOv5 π v4.0-207-gaff03be torch 1.8.0a0+1606899 CPU
Fusing layers...
Model Summary: 224 layers, 7053910 parameters, 0 gradients, 16.3 GFLOPS
TorchScript: starting export with torch 1.8.0a0+1606899...
/usr/src/app/models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as /usr/src/model.torchscript.pt
ONNX: starting export with onnx 1.7.0...
ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
ONNX: export success, saved as /usr/src/model.onnx
CoreML: export failure: No module named 'coremltools'
Export complete (5.41s). Visualize with https://github.com/lutzroeder/netron.
EDIT: with --dynamic
I get:
ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
ONNX: simplifier failure: The shape of input "images" has dynamic size "[0, 3, 0, 0]", please determine the input size manually by "--dynamic-input-shape --input-shape xxx" or "--input-shape xxx". Run "python3 -m onnxsim -h" for details
@piotlinski Update to the latest yolo version to fix the error with --dynamic
, and get the actual error.
@timstokman no error with --dynamic
and latest version here, provided the same versions of libraries as above.
python models/export.py --weights /usr/src/model.pt --img 288 480 --grid --dynamic --batch-size 1
Namespace(batch_size=1, device='cpu', dynamic=True, grid=True, img_size=[288, 480], simplify=False, weights='/usr/src/model.pt')
YOLOv5 π v5.0-17-gc949fc8 torch 1.8.0a0+1606899 CPU
Fusing layers...
Model Summary: 224 layers, 7053910 parameters, 0 gradients, 16.3 GFLOPS
TorchScript: starting export with torch 1.8.0a0+1606899...
/usr/src/app/models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as /usr/src/model.torchscript.pt
ONNX: starting export with onnx 1.7.0...
ONNX: export success, saved as /usr/src/model.onnx
CoreML: export failure: No module named 'coremltools'
Export complete (3.73s). Visualize with https://github.com/lutzroeder/netron.
when running with --simplify
I get only some info
ONNX: starting export with onnx 1.7.0...
ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
ONNX: export success, saved as /usr/src/model.onnx
Looks like the docker image also has different versions of onnx and onnx-simplifier. Maybe the requirements.txt of the yolo project needs to start pinning a few versions for this to work reliably.
@piotlinski Can you actually do inference with the exported model?
@timstokman the ones exported earlier (without the --dynamic
flag) work OK. I haven't checked the dynamic models
I change this line in model/yolo.py
, and then pass both --grid
and --grid --simplify
. Onnx runtime works fine too
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * torch.tensor(self.anchor_grid[i].tolist()).float() # wh
I found that the cause of the bug is the inconsistent behavior of the [i] symbol in pytorch and onnx. the shape of anchor_grid
is (3,1,3,1,1,2), the shape of anchor_grid[i]
is (1,3,1,1,2) in pytorchοΌ but it is (1,1,3,1,1,2) in onnx.
so we must clarify the shape of anchor_grid[i]
. just modified the line in model/yolo.py
:
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2) # wh
btw, the exported onnx cannot be converted to tensorrt engine because subscript assignments generate unsupported ScatterND nodes. I rewrite the code to avoid generating ScatterND
# y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
# z.append(y.view(bs, -1, self.no))
xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2) # wh
rest = y[..., 4:]
yy = torch.cat((xy, wh, rest), -1)
z.append(yy.view(bs, -1, self.no))
@jylink Tried your code, exporting works fine now, when I try to use dynamic axes it still seems to fail when running the model:
E onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_455' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60
Here I tried a tensor of 1088x1920x3 as input (stride 32 padded) for an image that was originally 1080x1920x3.
When using a fully padded tensor, 1920x1920x3, the predict layer does seem to work correctly, so this is a big improvement. I suggest you create a pull request for it.
Personally I still can't use --grid exports, dynamic axes gives me an almost 2x speed improvement and helps with CUDA memory usage.
@timstokman Hi, I found that the self.grid[i]
mismatch the dynamic y[..., 0:2]
. Dont know if it is the best way but I add a variable self.dynamic
and pass all --dynamic
, --grid
, --simplify
, onnxruntime and tensorrtEngine
# model/yolo.py
class Detect(nn.Module):
stride = None # strides computed during build
export = False # onnx export
dynamic = False # <--NEW
...
if not self.training: # inference
if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]: # <--NEW
self.grid[i] = self._make_grid(nx, ny).to(x[i].device)
y = x[i].sigmoid()
xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2) # wh
rest = y[..., 4:]
y_ = torch.cat((xy, wh, rest), -1)
z.append(y_.view(bs, -1, self.no))
# model/export.py
model.model[-1].export = not opt.grid # set Detect() layer grid export
model.model[-1].dynamic = opt.dynamic # <--NEW
for _ in range(2):
y = model(img) # dry runs
Test:
# gen onnx
!python models/export.py --img 352 608 --batch 1 --dynamic --grid --simplify --weights weights/best.pt
# onnxruntime
sess = rt.InferenceSession('weights/best.onnx')
input_name = sess.get_inputs()[0].name
output_name = []
for output in sess.get_outputs():
output_name.append(output.name)
for i in range(-5, 5):
input = np.random.rand(1, 3, 608 + 32 * i, 608).astype(np.float32)
pred = sess.run(output_name, {input_name: input})
input = np.random.rand(1, 3, 608, 608 + 32 * i).astype(np.float32)
pred = sess.run(output_name, {input_name: input})
Yes, that fixes all the issues for me. Outputs seem exactly the same, with and without dynamic, and it works for different image sizes. Guess I can throw away my own numpy implementation of the detect layer. It also fixes the framework version compatibility issues. To me, the implementation seems good. Pull request time?
@glenn-jocher Looks like this fixes the remaining options with "--grid".
@antlamon @tommy2is @timstokman good news π! Your original issue may now been fixed β in merged PR #2982 by @jylink. To receive this update you can:
git pull
from within your yolov5/
directorygit clone https://github.com/ultralytics/yolov5
againmodel = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 π!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
anchor_grid
so the bug is in self.anchor_grid
not in self.grid
hhhha. nice work!
π Bug
An exported model as ONNX using --grid parameter cannot be used by onnx-runtime or simplified by onnx-simplifier A Mul Node triggers a shape inference error Incompatible dimensions
To Reproduce
Replace ONNX export in export.py with this code and run with command
python3 models/export.py --grid
Output:
Expected behavior
Any yolov5 model exported as ONNX should be valid