Shape Error when training HF deberta-base with Inductor

Lokiiiiii commented 1 year ago

🐛 Describe the bug

When using HuggingFace's Trainer API I noticed that PyTorch eager mode succeeds as expected but inductor fails with a shape mismatch error:

ValueError: Cannot view a tensor with shape torch.Size([1, 256, 12, 64]) and strides (196608, 64, 16384, 1) as a tensor with shape (1, 256, 768)!

This only happens with the deberta-base model

Error logs

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/lib/python3.8/site-packages/torch/_dynamo/output_graph.py:670 in call_user_compiler   │
│                                                                                                  │
│   667 │   │   │   elif config.DO_NOT_USE_legacy_non_fake_example_inputs:                         │
│   668 │   │   │   │   compiled_fn = compiler_fn(gm, self.example_inputs())                       │
│   669 │   │   │   else:                                                                          │
│ ❱ 670 │   │   │   │   compiled_fn = compiler_fn(gm, self.fake_example_inputs())                  │
│   671 │   │   │   _step_logger()(logging.INFO, f"done compiler function {name}")                 │
│   672 │   │   │   assert callable(compiled_fn), "compiler_fn did not return callable"            │
│   673 │   │   except Exception as e:                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_dynamo/debug_utils.py:1055 in debug_wrapper        │
│                                                                                                  │
│   1052 │   │   │   │   │   )                                                                     │
│   1053 │   │   │   │   │   raise                                                                 │
│   1054 │   │   else:                                                                             │
│ ❱ 1055 │   │   │   compiled_gm = compiler_fn(gm, example_inputs)                                 │
│   1056 │   │                                                                                     │
│   1057 │   │   return compiled_gm                                                                │
│   1058                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/__init__.py:1390 in __call__                        │
│                                                                                                  │
│   1387 │   def __call__(self, model_, inputs_):                                                  │
│   1388 │   │   from torch._inductor.compile_fx import compile_fx                                 │
│   1389 │   │                                                                                     │
│ ❱ 1390 │   │   return compile_fx(model_, inputs_, config_patches=self.config)                    │
│   1391                                                                                           │
│   1392                                                                                           │
│   1393 def compile(model: Optional[Callable] = None, *,                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_inductor/compile_fx.py:455 in compile_fx           │
│                                                                                                  │
│   452 │   │   # TODO: can add logging before/after the call to create_aot_dispatcher_function    │
│   453 │   │   # in torch._functorch/aot_autograd.py::aot_module_simplified::aot_function_simpl   │
│   454 │   │   # once torchdynamo is merged into pytorch                                          │
│ ❱ 455 │   │   return aot_autograd(                                                               │
│   456 │   │   │   fw_compiler=fw_compiler,                                                       │
│   457 │   │   │   bw_compiler=bw_compiler,                                                       │
│   458 │   │   │   decompositions=select_decomp_table(),                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_dynamo/backends/common.py:48 in compiler_fn        │
│                                                                                                  │
│    45 │   │   try:                                                                               │
│    46 │   │   │   # NB: NOT cloned!                                                              │
│    47 │   │   │   with enable_aot_logging():                                                     │
│ ❱  48 │   │   │   │   cg = aot_module_simplified(gm, example_inputs, **kwargs)                   │
│    49 │   │   │   │   counters["aot_autograd"]["ok"] += 1                                        │
│    50 │   │   │   │   return eval_frame.disable(cg)                                              │
│    51 │   │   except Exception:                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:2805 in                  │
│ aot_module_simplified                                                                            │
│                                                                                                  │
│   2802 │   full_args.extend(params_flat)                                                         │
│   2803 │   full_args.extend(args)                                                                │
│   2804 │                                                                                         │
│ ❱ 2805 │   compiled_fn = create_aot_dispatcher_function(                                         │
│   2806 │   │   functional_call,                                                                  │
│   2807 │   │   full_args,                                                                        │
│   2808 │   │   aot_config,                                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_dynamo/utils.py:163 in time_wrapper                │
│                                                                                                  │
│    160 │   │   │   if key not in compilation_metrics:                                            │
│    161 │   │   │   │   compilation_metrics[key] = []                                             │
│    162 │   │   │   t0 = time.time()                                                              │
│ ❱  163 │   │   │   r = func(*args, **kwargs)                                                     │
│    164 │   │   │   time_spent = time.time() - t0                                                 │
│    165 │   │   │   # print(f"Dynamo timer: key={key}, latency={latency:.2f} sec")                │
│    166 │   │   │   compilation_metrics[key].append(time_spent)                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:2498 in                  │
│ create_aot_dispatcher_function                                                                   │
│                                                                                                  │
│   2495 │   │   compiler_fn = partial(aot_wrapper_dedupe, compiler_fn=compiler_fn)                │
│   2496 │   │   # You can put more passes here                                                    │
│   2497 │   │                                                                                     │
│ ❱ 2498 │   │   compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config)                    │
│   2499 │   │                                                                                     │
│   2500 │   │   if not hasattr(compiled_fn, "_boxed_call"):                                       │
│   2501 │   │   │   compiled_fn = make_boxed_func(compiled_fn)                                    │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:1713 in                  │
│ aot_wrapper_dedupe                                                                               │
│                                                                                                  │
│   1710 │   │   │   │   break                                                                     │
│   1711 │   │                                                                                     │
│   1712 │   │   if ok:                                                                            │
│ ❱ 1713 │   │   │   return compiler_fn(flat_fn, leaf_flat_args, aot_config)                       │
│   1714 │                                                                                         │
│   1715 │   # Strategy 2: Duplicate specialize.                                                   │
│   1716 │   #                                                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:2087 in                  │
│ aot_dispatch_autograd                                                                            │
│                                                                                                  │
│   2084 │   if config.use_functionalize:                                                          │
│   2085 │   │   with enable_python_dispatcher():                                                  │
│   2086 │   │   │   flattened_joints, _ = pytree.tree_flatten(joint_inputs)                       │
│ ❱ 2087 │   │   │   fx_g = make_fx(joint_forward_backward, aot_config.decompositions)(            │
│   2088 │   │   │   │   *joint_inputs                                                             │
│   2089 │   │   │   )                                                                             │
│   2090                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py:714 in wrapped      │
│                                                                                                  │
│   711 │   │   # thus irrelevant to any external functional trace.                                │
│   712 │   │   with decompose(decomposition_table), fake_tensor_mode, python_dispatcher_mode, \   │
│   713 │   │   │    sym_mode, proxy_mode, disable_autocast_cache(), disable_proxy_modes_tracing   │
│ ❱ 714 │   │   │   t = dispatch_trace(wrap_key(func, args, fx_tracer), tracer=fx_tracer, concre   │
│   715 │   │                                                                                      │
│   716 │   │   # TODO: kind of a bad way to do it, should maybe figure out a better way           │
│   717 │   │   if tracing_mode == "symbolic":                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:209 in _fn                    │
│                                                                                                  │
│   206 │   │   │   dynamic_ctx = enable_dynamic(self.dynamic)                                     │
│   207 │   │   │   dynamic_ctx.__enter__()                                                        │
│   208 │   │   │   try:                                                                           │
│ ❱ 209 │   │   │   │   return fn(*args, **kwargs)                                                 │
│   210 │   │   │   finally:                                                                       │
│   211 │   │   │   │   set_eval_frame(prior)                                                      │
│   212 │   │   │   │   dynamic_ctx.__exit__(None, None, None)                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py:443 in              │
│ dispatch_trace                                                                                   │
│                                                                                                  │
│   440 │   │   tracer: Tracer,                                                                    │
│   441 │   │   concrete_args: Optional[Tuple[Any, ...]] = None,                                   │
│   442 ) -> GraphModule:                                                                          │
│ ❱ 443 │   graph = tracer.trace(root, concrete_args)                                              │
│   444 │   name = root.__class__.__name__ if isinstance(root, torch.nn.Module) else root.__name   │
│   445 │   return GraphModule(tracer.root, graph, name)                                           │
│   446                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:209 in _fn                    │
│                                                                                                  │
│   206 │   │   │   dynamic_ctx = enable_dynamic(self.dynamic)                                     │
│   207 │   │   │   dynamic_ctx.__enter__()                                                        │
│   208 │   │   │   try:                                                                           │
│ ❱ 209 │   │   │   │   return fn(*args, **kwargs)                                                 │
│   210 │   │   │   finally:                                                                       │
│   211 │   │   │   │   set_eval_frame(prior)                                                      │
│   212 │   │   │   │   dynamic_ctx.__exit__(None, None, None)                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py:778 in trace                  │
│                                                                                                  │
│    775 │   │   │   │   self.create_node(                                                         │
│    776 │   │   │   │   │   "output",                                                             │
│    777 │   │   │   │   │   "output",                                                             │
│ ❱  778 │   │   │   │   │   (self.create_arg(fn(*args)),),                                        │
│    779 │   │   │   │   │   {},                                                                   │
│    780 │   │   │   │   │   type_expr=fn.__annotations__.get("return", None),                     │
│    781 │   │   │   │   )                                                                         │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py:652 in flatten_fn             │
│                                                                                                  │
│    649 │   │   │                                                                                 │
│    650 │   │   │   def flatten_fn(*args):                                                        │
│    651 │   │   │   │   tree_args = pytree.tree_unflatten(list(args), in_spec)                    │
│ ❱  652 │   │   │   │   tree_out = root_fn(*tree_args)                                            │
│    653 │   │   │   │   out_args, out_spec = pytree.tree_flatten(tree_out)                        │
│    654 │   │   │   │   assert isinstance(self.graph._codegen, _PyTreeCodeGen)                    │
│    655 │   │   │   │   self.graph._codegen.pytree_info = (                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py:459 in wrapped      │
│                                                                                                  │
│   456 │   │   with _pop_mode_temporarily():                                                      │
│   457 │   │   │   track_tensor_tree(flat_tensors, flat_proxies, constant=None, tracer=tracer)    │
│   458 │   │                                                                                      │
│ ❱ 459 │   │   out = f(*tensors)                                                                  │
│   460 │   │   out = pytree.tree_map_only(                                                        │
│   461 │   │   │   torch.Tensor,                                                                  │
│   462 │   │   │   lambda t: get_proxy_slot(t, tracer, t, lambda x: x.proxy),                     │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:1156 in traced_joint     │
│                                                                                                  │
│   1153 │   # the joint needs have args named "primals" and "tangents",                           │
│   1154 │   # which are hardcoded into the partitioning logic.                                    │
│   1155 │   def traced_joint(primals, tangents):                                                  │
│ ❱ 1156 │   │   return functionalized_f_helper(primals, tangents)                                 │
│   1157 │                                                                                         │
│   1158 │   def traced_forward(*primals):                                                         │
│   1159 │   │   return functionalized_f_helper(primals)                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:1108 in                  │
│ functionalized_f_helper                                                                          │
│                                                                                                  │
│   1105 │   │   torch._enable_functionalization(reapply_views=True)                               │
│   1106 │   │   try:                                                                              │
│   1107 │   │   │   # Run the joint                                                               │
│ ❱ 1108 │   │   │   f_outs = flat_fn_no_input_mutations(fn, f_primals, f_tangents, meta, keep_in  │
│   1109 │   │   finally:                                                                          │
│   1110 │   │   │   torch._disable_functionalization()                                            │
│   1111                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:1076 in                  │
│ flat_fn_no_input_mutations                                                                       │
│                                                                                                  │
│   1073 │   │   ]                                                                                 │
│   1074 │   else:                                                                                 │
│   1075 │   │   primals_after_cloning = primals                                                   │
│ ❱ 1076 │   outs = flat_fn_with_synthetic_bases_expanded(fn, primals, primals_after_cloning, may  │
│   1077 │   return outs                                                                           │
│   1078                                                                                           │
│   1079 # This creates the final function that we want to trace using make_fx(),                  │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:1048 in                  │
│ flat_fn_with_synthetic_bases_expanded                                                            │
│                                                                                                  │
│   1045 │   # *after* we clone inputs for autograd (see below), to preserve the view relationshi  │
│   1046 │   primals = unpack_synthetic_bases(primals_after_cloning, meta.synthetic_base_info)     │
│   1047 │   assert len(meta.fw_metadata.input_info) == len(primals)                               │
│ ❱ 1048 │   outs = forward_or_joint(fn, primals_before_cloning, primals, maybe_tangents, meta, k  │
│   1049 │   return outs                                                                           │
│   1050                                                                                           │
│   1051 # This function adds extra clone() calls on any inputs in the forward that get mutated.   │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py:1017 in forward_or_joint │
│                                                                                                  │
│   1014 │   # Call the backwards pass                                                             │
│   1015 │   if grad_primals:                                                                      │
│   1016 │   │   with fx_traceback.preserve_node_meta():                                           │
│ ❱ 1017 │   │   │   backward_out = torch.autograd.grad(                                           │
│   1018 │   │   │   │   needed_outs,                                                              │
│   1019 │   │   │   │   grad_primals,                                                             │
│   1020 │   │   │   │   grad_outputs=needed_tangents,                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py:269 in grad                    │
│                                                                                                  │
│   266 │   t_inputs = cast(Tuple[torch.Tensor, ...], (inputs,) if is_tensor_like(inputs) else t   │
│   267 │   overridable_args = t_outputs + t_inputs                                                │
│   268 │   if has_torch_function(overridable_args):                                               │
│ ❱ 269 │   │   return handle_torch_function(                                                      │
│   270 │   │   │   grad,                                                                          │
│   271 │   │   │   overridable_args,                                                              │
│   272 │   │   │   t_outputs,                                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/overrides.py:1534 in handle_torch_function          │
│                                                                                                  │
│   1531 │   │   # if we're here, the mode must be set to a TorchFunctionStackMode                 │
│   1532 │   │   # this unsets it and calls directly into TorchFunctionStackMode's torch function  │
│   1533 │   │   with _pop_mode_temporarily() as mode:                                             │
│ ❱ 1534 │   │   │   result = mode.__torch_function__(public_api, types, args, kwargs)             │
│   1535 │   │   if result is not NotImplemented:                                                  │
│   1536 │   │   │   return result                                                                 │
│   1537                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_inductor/overrides.py:38 in __torch_function__     │
│                                                                                                  │
│    35 │   │   │   and replacements[func] in replacements_using_triton_random                     │
│    36 │   │   ):                                                                                 │
│    37 │   │   │   return replacements[func](*args, **kwargs)                                     │
│ ❱  38 │   │   return func(*args, **kwargs)                                                       │
│    39                                                                                            │
│    40                                                                                            │
│    41 patch_functions = AutogradMonkeypatch                                                      │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py:303 in grad                    │
│                                                                                                  │
│   300 │   │   │   │   allow_unused, accumulate_grad=False)  # Calls into the C++ engine to run   │
│   301 │   │   return _vmap_internals._vmap(vjp, 0, 0, allow_none_pass_through=True)(grad_outpu   │
│   302 │   else:                                                                                  │
│ ❱ 303 │   │   return Variable._execution_engine.run_backward(  # Calls into the C++ engine to    │
│   304 │   │   │   t_outputs, grad_outputs_, retain_graph, create_graph, t_inputs,                │
│   305 │   │   │   allow_unused, accumulate_grad=False)  # Calls into the C++ engine to run the   │
│   306                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/utils/_stats.py:20 in wrapper                       │
│                                                                                                  │
│   17 │   │   if fn.__qualname__ not in simple_call_counter:                                      │
│   18 │   │   │   simple_call_counter[fn.__qualname__] = 0                                        │
│   19 │   │   simple_call_counter[fn.__qualname__] = simple_call_counter[fn.__qualname__] + 1     │
│ ❱ 20 │   │   return fn(*args, **kwargs)                                                          │
│   21 │   return wrapper                                                                          │
│   22                                                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py:487 in              │
│ __torch_dispatch__                                                                               │
│                                                                                                  │
│   484 │   @count                                                                                 │
│   485 │   def __torch_dispatch__(self, func, types, args=(), kwargs=None):                       │
│   486 │   │   with self.sym_mode.enable(False):                                                  │
│ ❱ 487 │   │   │   return self.inner_torch_dispatch(func, types, args, kwargs)                    │
│   488 │                                                                                          │
│   489 │   def __enter__(self):                                                                   │
│   490 │   │   # sym mode first, then us...                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py:512 in              │
│ inner_torch_dispatch                                                                             │
│                                                                                                  │
│   509 │   │   if func in [prim.device.default]:                                                  │
│   510 │   │   │   return func(*args, **kwargs)                                                   │
│   511 │   │                                                                                      │
│ ❱ 512 │   │   out = proxy_call(self, func, args, kwargs)                                         │
│   513 │   │   return out                                                                         │
│   514                                                                                            │
│   515                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/fx/experimental/proxy_tensor.py:345 in proxy_call   │
│                                                                                                  │
│   342 │   │   else:                                                                              │
│   343 │   │   │   args[0].proxy = proxy_out                                                      │
│   344 │                                                                                          │
│ ❱ 345 │   out = func(*args, **kwargs)                                                            │
│   346 │                                                                                          │
│   347 │   # In some circumstances, we will be tracing in a situation where a tensor              │
│   348 │   # is *statically* known to be a constant (currently, this only happens if              │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_ops.py:284 in __call__                             │
│                                                                                                  │
│   281 │   │   )                                                                                  │
│   282 │                                                                                          │
│   283 │   def __call__(self, *args, **kwargs):                                                   │
│ ❱ 284 │   │   return self._op(*args, **kwargs or {})                                             │
│   285 │                                                                                          │
│   286 │   def __hash__(self):                                                                    │
│   287 │   │   return hash(self._op)                                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/utils/_stats.py:20 in wrapper                       │
│                                                                                                  │
│   17 │   │   if fn.__qualname__ not in simple_call_counter:                                      │
│   18 │   │   │   simple_call_counter[fn.__qualname__] = 0                                        │
│   19 │   │   simple_call_counter[fn.__qualname__] = simple_call_counter[fn.__qualname__] + 1     │
│ ❱ 20 │   │   return fn(*args, **kwargs)                                                          │
│   21 │   return wrapper                                                                          │
│   22                                                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_subclasses/fake_tensor.py:987 in                   │
│ __torch_dispatch__                                                                               │
│                                                                                                  │
│    984 │   @count                                                                                │
│    985 │   def __torch_dispatch__(self, func, types, args=(), kwargs=None):                      │
│    986 │   │   try:                                                                              │
│ ❱  987 │   │   │   return self.dispatch(func, types, args, kwargs)                               │
│    988 │   │   except TypeError:                                                                 │
│    989 │   │   │   log.exception("fake tensor raised TypeError")                                 │
│    990 │   │   │   raise                                                                         │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_subclasses/fake_tensor.py:1170 in dispatch         │
│                                                                                                  │
│   1167 │   │   # python meta registrations, prims, decomps, and c++ meta fns (structured kernel  │
│   1168 │   │   try:                                                                              │
│   1169 │   │   │   with in_kernel_invocation_manager(self):                                      │
│ ❱ 1170 │   │   │   │   r = func(*args, **kwargs)                                                 │
│   1171 │   │   except NotImplementedError as not_implemented_error:                              │
│   1172 │   │   │   # no meta kernel registered, fallback to kernel for the device                │
│   1173 │   │   │   if not self.allow_fallback_kernels:                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_ops.py:284 in __call__                             │
│                                                                                                  │
│   281 │   │   )                                                                                  │
│   282 │                                                                                          │
│   283 │   def __call__(self, *args, **kwargs):                                                   │
│ ❱ 284 │   │   return self._op(*args, **kwargs or {})                                             │
│   285 │                                                                                          │
│   286 │   def __hash__(self):                                                                    │
│   287 │   │   return hash(self._op)                                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_refs/__init__.py:3988 in view                      │
│                                                                                                  │
│   3985 # TODO: Turn this into a decomposition (currently fails on reshape meta tests)            │
│   3986 @register_decomposition(aten.view)                                                        │
│   3987 def view(a: TensorLikeType, *shape: ShapeType) -> TensorLikeType:                         │
│ ❱ 3988 │   return _reshape_view_helper(a, *shape, allow_copy=False)                              │
│   3989                                                                                           │
│   3990                                                                                           │
│   3991 # CompositeImplicitAutograd - don't register decomp                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/torch/_refs/__init__.py:3237 in _reshape_view_helper      │
│                                                                                                  │
│   3234 │   │   │   │   msg = "Cannot view a tensor with shape {0} and strides {1} as a tensor w  │
│   3235 │   │   │   │   │   a.shape, a.stride(), shape                                            │
│   3236 │   │   │   │   )                                                                         │
│ ❱ 3237 │   │   │   │   raise ValueError(msg)                                                     │
│   3238 │   │   │                                                                                 │
│   3239 │   │   │   a_ = flatten(a_, idx, end)                                                    │
│   3240                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot view a tensor with shape torch.Size([1, 256, 12, 64]) and strides (196608, 64, 16384, 1) as a tensor with shape (1, 256, 768)!

Minified repro

Minifier was unable to repro the error

pip3 install numpy --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu117
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e .

cd examples/pytorch/language-modeling
pip install -r requirements.txt
WANDB_DISABLED=true python run_mlm.py --model_name_or_path microsoft/deberta-base --output_dir . --fp16 --dataloader_drop_last --dataset_config_name wikitext-2-raw-v1 --dataset_name wikitext --do_train --evaluation_strategy no --logging_strategy epoch --max_seq_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 128 --save_strategy no --torch_compile_backend inductor

Versions

Collecting environment information...
PyTorch version: 2.0.0a0+git9cfa076
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.25.2
Libc version: glibc-2.31

Python version: 3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55)  [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1028-aws-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G
GPU 4: NVIDIA A10G
GPU 5: NVIDIA A10G
GPU 6: NVIDIA A10G

Nvidia driver version: 515.65.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          192
On-line CPU(s) list:             0-191
Thread(s) per core:              2
Core(s) per socket:              48
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7R32
Stepping:                        0
CPU MHz:                         2799.534
BogoMIPS:                        5599.06
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       3 MiB
L1i cache:                       3 MiB
L2 cache:                        48 MiB
L3 cache:                        384 MiB
NUMA node0 CPU(s):               0-47,96-143
NUMA node1 CPU(s):               48-95,144-191
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] clip-anytorch==2.5.2
[pip3] CoCa-pytorch==0.0.7
[pip3] dalle2-pytorch==1.10.5
[pip3] ema-pytorch==0.2.1
[pip3] functorch==1.14.0a0+408bcf1
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] pytorch-transformers==1.2.0
[pip3] pytorch-warmup==0.1.1
[pip3] rotary-embedding-torch==0.2.1
[pip3] sagemaker-pytorch-training==2.7.0
[pip3] torch==2.0.0a0+git9cfa076
[pip3] torch-fidelity==0.3.0
[pip3] torch-struct==0.5
[pip3] torchaudio==2.0.0a0+b96a7eb
[pip3] torchdata==0.5.1+a246b31
[pip3] torchmetrics==0.11.3
[pip3] torchrec-nightly==2023.3.6
[pip3] torchtext==0.14.0a0+5b78d07
[pip3] torchvision==0.14.1a0+b69fce3
[pip3] vector-quantize-pytorch==1.1.1
[conda] clip-anytorch             2.5.2                    pypi_0    pypi
[conda] coca-pytorch              0.0.7                    pypi_0    pypi
[conda] dalle2-pytorch            1.10.5                   pypi_0    pypi
[conda] ema-pytorch               0.2.1                    pypi_0    pypi
[conda] functorch                 1.14.0a0+408bcf1          pypi_0    pypi
[conda] magma-cuda117             2.6.1                         1    pytorch
[conda] mkl                       2022.2.1         h84fe81f_16997    conda-forge
[conda] mkl-include               2023.0.0         h84fe81f_26648    conda-forge
[conda] numpy                     1.21.2                   pypi_0    pypi
[conda] pytorch                   1.13.1          cpu_py38hbac4b8a_1    conda-forge
[conda] pytorch-transformers      1.2.0                    pypi_0    pypi
[conda] pytorch-warmup            0.1.1                    pypi_0    pypi
[conda] rotary-embedding-torch    0.2.1                    pypi_0    pypi
[conda] sagemaker-pytorch-training 2.7.0                    pypi_0    pypi
[conda] torch                     2.0.0a0+git9cfa076          pypi_0    pypi
[conda] torch-fidelity            0.3.0                    pypi_0    pypi
[conda] torch-struct              0.5                      pypi_0    pypi
[conda] torchaudio                2.0.0a0+b96a7eb          pypi_0    pypi
[conda] torchdata                 0.5.1            py38h60d003c_1    conda-forge
[conda] torchmetrics              0.11.3                   pypi_0    pypi
[conda] torchrec-nightly          2023.3.6                 pypi_0    pypi
[conda] torchtext                 0.14.0a0+5b78d07          pypi_0    pypi
[conda] torchvision               0.15.0a0+0bdd01a          pypi_0    pypi
[conda] vector-quantize-pytorch   1.1.1                    pypi_0    pypi

cc @ezyang @eellison @bdhirsh @msaroufim @wconstab @anijain2305 @zou3519 @ngimel @soumith

ezyang commented 1 year ago

Looks like a stride propagation error.

cc @dagitses for stride agnostic pytorch

davidberard98 commented 1 year ago

managed to get this more minimal repro, haven't looked much at it yet. (note - if you're trying to repro the original transformers issue, you need to run with a single gpu or else you'll run into some other faketensor issue)

import torch

x = torch.rand((1, 12, 256*64), requires_grad=True)

def transpose_for_scores(x):
    new_x_shape = x.size()[:-1] + (256, -1)
    x = x.view(new_x_shape)
    return x.permute(0, 2, 1, 3)

def fn(x):
    scale_factor = 0.5
    x = x.relu()
    x = transpose_for_scores(x)
    x /= torch.sqrt(torch.tensor(x.size(-1), dtype=torch.float) * scale_factor)
    return x.transpose(-1, -2)

fn(x)
torch.compile(fn)(x)

eellison commented 1 year ago

Hmm neither CrossRefFakeMode nor DebugInterpreter catch this.

anijain2305 commented 1 year ago

Even aot_eager fails here.

import torch

x = torch.rand((1, 12, 256*64), requires_grad=True)

def transpose_for_scores(x):
    new_x_shape = x.size()[:-1] + (256, -1)
    x = x.view(new_x_shape)
    return x.permute(0, 2, 1, 3)

def fn(x):
    scale_factor = 0.5
    x = x.relu()
    x = transpose_for_scores(x)
    x /= torch.sqrt(torch.tensor(x.size(-1), dtype=torch.float) * scale_factor)
    return x.transpose(-1, -2)

fn(x)
torch.compile(fn, backend="aot_eager")(x)

anijain2305 commented 1 year ago

cc @ezyang @bdhirsh to advise.

ngimel commented 1 year ago

THe minimum repros throw different error ("one of the variables needed for gradient computation has been modified by an inplace operation"). THe original view error is probably due to copy_ decomposition producing wrong strides, @bdhirsh has a fix for this that is blocked by cpp codegen in fbcode

bdhirsh commented 1 year ago

That looks like something that should be fixed by this PR https://github.com/pytorch/pytorch/issues/96456#issuecomment-1562284376. I can't test it at the moment (allocation was nuked) but I can try to confirm later.

bdhirsh commented 1 year ago

Unfortunately even with the copy() decomp fix in inductor, the repro now gives this error for me:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 12, 16384]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead.

bdhirsh commented 1 year ago

Actually, I realized that the small repro above is broken (that error also shows up if you run in eager, and actually call .backward()).

bdhirsh commented 1 year ago

I tried running the HuggingFace repro. On my 40gb machine, I get an OOM - it would be great if someone can patch this PR locally and try to repro! https://github.com/pytorch/pytorch/issues/96456.

ezyang commented 1 year ago

@davidberard98's repro still fails for me in AOTAutograd https://github.com/pytorch/pytorch/issues/96456#issuecomment-1467355129

  File "/data/users/ezyang/b/pytorch/torch/fx/experimental/proxy_tensor.py", line 532, in __torch_dispatch__
    return self.inner_torch_dispatch(func, types, args, kwargs)
  File "/data/users/ezyang/b/pytorch/torch/fx/experimental/proxy_tensor.py", line 557, in inner_torch_dispatch
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "/data/users/ezyang/b/pytorch/torch/fx/experimental/proxy_tensor.py", line 367, in proxy_call
    out = func(*args, **kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_ops.py", line 429, in __call__
    return self._op(*args, **kwargs or {})
  File "/data/users/ezyang/b/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_subclasses/fake_tensor.py", line 1160, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_subclasses/fake_tensor.py", line 1404, in dispatch
    r = func(*args, **kwargs)
  File "/data/users/ezyang/b/pytorch/torch/_ops.py", line 429, in __call__
    return self._op(*args, **kwargs or {})
  File "/data/users/ezyang/b/pytorch/torch/_refs/__init__.py", line 4138, in view
    return _reshape_view_helper(a, *shape, allow_copy=False)
  File "/data/users/ezyang/b/pytorch/torch/_refs/__init__.py", line 3352, in _reshape_view_helper
    raise ValueError(msg)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
ValueError: Cannot view a tensor with shape torch.Size([1, 12, 256, 64]) and strides (196608, 64, 768, 1) as a tensor with shape (1, 12, 16384)!

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

williamwen42 commented 1 year ago

I get a different error when I try to run @davidberard98's repro today:

/data/users/williamwen/pytorch/torch/autograd/__init__.py:411: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
  File "/data/users/williamwen/pytorch/playground5.py", line 12, in fn
    x = x.relu()
 (Triggered internally at /data/users/williamwen/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:113.)
  result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/data/users/williamwen/pytorch/playground5.py", line 18, in <module>
    torch.compile(fn)(x)
  File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 655, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/data/users/williamwen/pytorch/torch/_dynamo/convert_frame.py", line 721, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/data/users/williamwen/pytorch/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
    compiled_product = _compile(
  File "/data/users/williamwen/pytorch/torch/_dynamo/convert_frame.py", line 645, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/data/users/williamwen/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_dynamo/convert_frame.py", line 562, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/data/users/williamwen/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/data/users/williamwen/pytorch/torch/_dynamo/convert_frame.py", line 151, in _fn
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_dynamo/convert_frame.py", line 527, in transform
    tracer.run()
  File "/data/users/williamwen/pytorch/torch/_dynamo/symbolic_convert.py", line 2123, in run
    super().run()
  File "/data/users/williamwen/pytorch/torch/_dynamo/symbolic_convert.py", line 818, in run
    and self.step()
  File "/data/users/williamwen/pytorch/torch/_dynamo/symbolic_convert.py", line 781, in step
    getattr(self, inst.opname)(inst)
  File "/data/users/williamwen/pytorch/torch/_dynamo/symbolic_convert.py", line 2238, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/data/users/williamwen/pytorch/torch/_dynamo/output_graph.py", line 912, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/data/users/williamwen/py310-env/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/williamwen/pytorch/torch/_dynamo/output_graph.py", line 1080, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/data/users/williamwen/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_dynamo/output_graph.py", line 1152, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/data/users/williamwen/pytorch/torch/_dynamo/output_graph.py", line 1133, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/data/users/williamwen/pytorch/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/data/users/williamwen/pytorch/torch/__init__.py", line 1657, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/data/users/williamwen/pytorch/torch/_inductor/compile_fx.py", line 1168, in compile_fx
    return aot_autograd(
  File "/data/users/williamwen/pytorch/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 4938, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/data/users/williamwen/pytorch/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 4478, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 2813, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 2999, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 3700, in aot_dispatch_autograd
    fx_g, joint_inputs, maybe_subclass_meta = aot_dispatch_autograd_graph(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 3680, in aot_dispatch_autograd_graph
    fx_g = create_graph(joint_fn_to_trace, updated_joint_inputs, aot_config=aot_config)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 1943, in create_graph
    fx_g = make_fx(f, decomposition_table=aot_config.decompositions)(*args)
  File "/data/users/williamwen/pytorch/torch/fx/experimental/proxy_tensor.py", line 869, in wrapped
    t = dispatch_trace(wrap_key(func, args, fx_tracer, pre_dispatch), tracer=fx_tracer, concrete_args=tuple(phs))
  File "/data/users/williamwen/pytorch/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/fx/experimental/proxy_tensor.py", line 481, in dispatch_trace
    graph = tracer.trace(root, concrete_args)
  File "/data/users/williamwen/pytorch/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch/torch/fx/_symbolic_trace.py", line 821, in trace
    (self.create_arg(fn(*args)),),
  File "/data/users/williamwen/pytorch/torch/fx/_symbolic_trace.py", line 688, in flatten_fn
    tree_out = root_fn(*tree_args)
  File "/data/users/williamwen/pytorch/torch/fx/experimental/proxy_tensor.py", line 517, in wrapped
    out = f(*tensors)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 1929, in joint_helper
    return functionalized_f_helper(primals, tangents)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 1882, in functionalized_f_helper
    f_outs = fn(*f_args)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 1850, in inner_fn_with_anomaly
    return inner_fn(*args)
  File "/data/users/williamwen/pytorch/torch/_functorch/aot_autograd.py", line 1833, in inner_fn
    backward_out = torch.autograd.grad(
  File "/data/users/williamwen/pytorch/torch/autograd/__init__.py", line 411, in grad
    result = Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 12, 16384]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

bdhirsh commented 11 months ago

@Lokiiiiii can you re-open if you're still seeing an issue? David's smaller repro above no longer fails with the original error, as Yanbo pointed out. The new error is actually because the minimized repro isn't quite valid - even in eager mode, that code will fail if you call out.sum().backward(), because the repro code is mutating the output of relu(), which was saved for backward.

pytorch / pytorch