pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.43k stars 22.74k forks source link

Export torchvision detection model retinanet_resnet50_fpn #108805

Open ezyang opened 1 year ago

ezyang commented 1 year ago

🐛 Describe the bug

This model export was requested by user at https://github.com/pytorch/pytorch/issues/108337 . It is fairly similar to maskrcnn which is one of the priority models that internal export. While investigating the user report I hacked up a bunch of stuff in torchvision which isn't easily landable, I want to durably record it here.

Repro script:

import torch
import torchvision

model = torchvision.models.detection.retinanet_resnet50_fpn(
        weights=torchvision.models.detection.RetinaNet_ResNet50_FPN_Weights.DEFAULT)
model = torch.compile(model, mode="default")

torch.save(model.state_dict(), "retina.pt")
print("model.state_dict:",model.state_dict().keys())

x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
model.eval()
exported_model = torch._dynamo.export(model, x)
torch.save(exported_model, "retina_retina.pt")

Tested on 621463a3e6b488b2bff04e355a1abd9a4c5bb2cd

Last time I did this for maskrcnn: https://docs.google.com/document/d/159NTQQhz8ovIBxbQvGQ-fZ10pF9e2RPXm1JZYqdEzt4/edit#heading=h.jw6vkqei769s

f-string problem. Same as in maskrcnn. Dynamo still chokes on f-string messages (known bug for quite a long time, https://github.com/pytorch/pytorch/issues/103602 )

diff --git a/torchvision/models/detection/retinanet.py b/torchvision/models/detection/retinanet.py
index 3a9cf80d1d..3fe67f6440 100644
--- a/torchvision/models/detection/retinanet.py
+++ b/torchvision/models/detection/retinanet.py
@@ -596,10 +596,7 @@ class RetinaNet(nn.Module):
         original_image_sizes: List[Tuple[int, int]] = []
         for img in images:
             val = img.shape[-2:]
-            torch._assert(
-                len(val) == 2,
-                f"expecting the last two dimensions of the Tensor to be H and W instead got {img.shape[-2:]}",
-            )
+            assert len(val) == 2
             original_image_sizes.append((val[0], val[1]))

         # transform the input

Scale factor tire fire. Same as in maskrcnn. I actually half fixed this but one last bit I forgot to do last time https://github.com/pytorch/vision/pull/7942

Data-dependent orig_kval.

diff --git a/torchvision/models/detection/_utils.py b/torchvision/models/detection/_utils.py
index 559db858ac..bb13d8b72c 100644
--- a/torchvision/models/detection/_utils.py
+++ b/torchvision/models/detection/_utils.py
@@ -506,7 +506,7 @@ def _topk_min(input: Tensor, orig_kval: int, axis: int) -> int:
         min_kval (int): Appropriately selected k-value.
     """
     if not torch.jit.is_tracing():
-        return min(orig_kval, input.size(axis))
+        return torch.sym_min(orig_kval, input.size(axis))
     axis_dim_val = torch._shape_as_tensor(input)[axis].unsqueeze(0)
     min_kval = torch.min(torch.cat((torch.tensor([orig_kval], dtype=axis_dim_val.dtype), axis_dim_val), 0))
     return _fake_cast_onnx(min_kval)

orig_kval is unbacked so we cannot do a stock min on it which will attempt to test if orig_kval < input.size(axis) which we won't know. Dynamo ought to be able to translate this to sym_min automatically.

NN module setattr Same as in maskrcnn

diff --git a/torch/_dynamo/symbolic_convert.py b/torch/_dynamo/symbolic_convert.py
index 9db4fb24bb5..56b5c5e3be0 100644
--- a/torch/_dynamo/symbolic_convert.py
+++ b/torch/_dynamo/symbolic_convert.py
@@ -1201,13 +1201,6 @@ class InstructionTranslatorBase(Checkpointable[InstructionTranslatorGraphState])
         prior = self.copy_graphstate()
         val, obj = self.popn(2)

-        if isinstance(obj, NNModuleVariable):
-            # We don't allow side effects during export
-            # https://github.com/pytorch/torchdynamo/issues/1475
-            assert (
-                not self.export
-            ), f"Mutating module attribute {inst.argval} during export."
-
         try:
             self.output.guards.update(
                 BuiltinVariable(setattr)

Batched nms threshold trick. We can either torch.cond this, or taking a page from torchvision._is_tracing test here, just always do the coordinate trick.

diff --git a/torchvision/ops/boxes.py b/torchvision/ops/boxes.py
index a541f8d880..b60f6e6254 100644
--- a/torchvision/ops/boxes.py
+++ b/torchvision/ops/boxes.py
@@ -69,7 +69,7 @@ def batched_nms(
         _log_api_usage_once(batched_nms)
     # Benchmarks that drove the following thresholds are at
     # https://github.com/pytorch/vision/issues/1311#issuecomment-781329339
-    if boxes.numel() > (4000 if boxes.device.type == "cpu" else 20000) and not torchvision._is_tracing():
+    if False and boxes.numel() > (4000 if boxes.device.type == "cpu" else 20000) and not torchvision._is_tracing():
         return _batched_nms_vanilla(boxes, scores, idxs, iou_threshold)
     else:
         return _batched_nms_coordinate_trick(boxes, scores, idxs, iou_threshold)

nms support. I could have sworn that I meta-fied this but apparently not. NMS is data dependent so it needs @zou3519 impl_abstract for out of tree registration.

torch._dynamo.exc.Unsupported: unsupported operator: torchvision.nms.default (see https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0 for how to fix)

Versions

main

cc @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519

ezyang commented 1 year ago

cc @nxdong

ezyang commented 1 year ago

After nms https://github.com/pytorch/vision/pull/7944 there is:

Slice with unbacked SymInt. Workaround could be

diff --git a/torchvision/models/detection/retinanet.py b/torchvision/models/detection/retinanet.py
index 3a9cf80d1d..07aa5179cd 100644
--- a/torchvision/models/detection/retinanet.py
+++ b/torchvision/models/detection/retinanet.py
@@ -554,7 +554,7 @@ class RetinaNet(nn.Module):

             # non-maximum suppression
             keep = box_ops.batched_nms(image_boxes, image_scores, image_labels, self.nms_thresh)
-            keep = keep[: self.detections_per_img]
+            #keep = keep[: self.detections_per_img]

             detections.append(
                 {

The problem is we do not know statically if detections_per_img is in bounds or needs to be clamped. User can probably tell us it's guaranteed to be in bounds, I'm guessing.

Unspec NN module. Looks like

Traceback (most recent call last):
  File "/data/users/ezyang/b/pytorch/wn.py", line 13, in <module>
    exported_model = torch._dynamo.export(model, x)
  File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 1259, in export
    return inner(*extra_args, **extra_kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 1233, in inner
    graph = rewrite_signature(
  File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 865, in rewrite_signature
    matched_input_elements_positions = produce_matching(
  File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 858, in produce_matching
    raise AssertionError(
AssertionError: graph-captured input #2 (<class 'torch.Tensor'>) is not among original args (<class 'torch.Tensor'>, <class 'torch.Tensor'>)

This is because of cell_anchors write:

[2023-09-07 19:02:58,035] [0/0] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_ATTR cell_anchors [ListVariable(), NNModuleVariable()]
[2023-09-07 19:02:58,035] [0/0] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object set_cell_anchors at 0x7f3ccf8182f0, file "/data/users/ezyang/b/torchvision/torchvision/models/detection/anchor_utils.py", line 76>
[2023-09-07 19:02:58,037] [0/0] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 10 nodes
[2023-09-07 19:02:58,037] [0/0] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object forward at 0x7f3ccf818870, file "/data/users/ezyang/b/torchvision/torchvision/models/detection/anchor_utils.py", line 115>
[2023-09-07 19:02:58,039] [0/0] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 60 nodes
[2023-09-07 19:02:58,039] [0/0] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object _call_impl at 0x7f3d105a5210, file "/data/users/ezyang/b/pytorch/torch/nn/modules/module.py", line 1520>
[2023-09-07 19:02:58,039] [0/0] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 0 nodes
[2023-09-07 19:02:58,041] [0/0] torch._dynamo.convert_frame: [INFO] Restarting analysis due to _dynamo/variables/nn_module.py:138 in convert_to_unspecialized

When we restart with the module as unspecialized, this causes the anchor module's parameters to become inputs

[2023-09-07 19:03:13,095] [0/0] torch._dynamo.output_graph: [DEBUG] create_graph_input L_self_anchor_generator_cell_anchors_0_ L['self'].anchor_generator.cell_anchors[0]
ezyang commented 8 months ago

This still does not work, still choking on AssertionError: Mutating module attribute cell_anchors during export.

andylee-24 commented 3 months ago

This still does not work, still choking on AssertionError: Mutating module attribute cell_anchors during export.

I got the same error while converting the model using the ai_edge_torch library.