Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"
I am able to export the MiDaS model w/ DPT large backbone to ONNX using torch.onnx.exportalmost without modifying the model at all. Model constructor looks like: model = DPTDepthModel(path='dpt_large-midas-2f21e586.pt', backbone='vitl16_384', non_negative=True)
The one issue I have found is from this line in vit.py:
unflatten = nn.Sequential(
nn.Unflatten(
2,
torch.Size(
[
h // pretrained.model.patch_size[1],
w // pretrained.model.patch_size[0],
]
),
)
)
Which gives me this error when attempting to export to ONNX, from flatten.py in pytorch (I'm running torch 1.11.0):
TypeError: unflattened_size must be tuple of ints, but found element of type Tensor at pos 0
We don't want to cast h and w to int, because then the dimensionality will be hard-coded into the model and I would like the onnx model to support dynamic height & width.
So, why not do something like this instead:
unflatten = lambda layer: layer.view((
b,
layer.shape[1],
h // pretrained.model.patch_size[1],
w // pretrained.model.patch_size[0]
))
This seems to me to be functionally equivalent, if not as elegant because you also have to pass b and the channel count for each layer into the function as well.
I am able to export the MiDaS model w/ DPT large backbone to ONNX using
torch.onnx.export
almost without modifying the model at all. Model constructor looks like:model = DPTDepthModel(path='dpt_large-midas-2f21e586.pt', backbone='vitl16_384', non_negative=True)
The one issue I have found is from this line in
vit.py
:Which gives me this error when attempting to export to ONNX, from
flatten.py
in pytorch (I'm running torch 1.11.0):TypeError: unflattened_size must be tuple of ints, but found element of type Tensor at pos 0
We don't want to cast
h
andw
toint
, because then the dimensionality will be hard-coded into the model and I would like the onnx model to support dynamic height & width.So, why not do something like this instead:
This seems to me to be functionally equivalent, if not as elegant because you also have to pass
b
and the channel count for each layer into the function as well.