pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.6k stars 22.23k forks source link

TIMM inference accuracy error: cait_m36_384 #93644

Closed desertfire closed 1 year ago

desertfire commented 2 years ago

Repro:

benchmarks/timm_models.py -d cuda --inductor --float32  -k cait_m36_384
benchmarks/timm_models.py -d cuda --inductor --float32  -k ghostnet_100

cc @ezyang @soumith @msaroufim @wconstab @ngimel @bdhirsh

ezyang commented 1 year ago

This patch turns it into a runtime error:

diff --git a/aten/src/ATen/native/TensorShape.cpp b/aten/src/ATen/native/TensorShape.cpp
index 74b2cb8dcc2..5ca94b16978 100644
--- a/aten/src/ATen/native/TensorShape.cpp
+++ b/aten/src/ATen/native/TensorShape.cpp
@@ -1,4 +1,5 @@
 #include <ATen/ATen.h>
+#include <ATen/TensorSubclassLikeUtils.h>
 #include <ATen/AccumulateType.h>
 #include <ATen/ExpandUtils.h>
 #include <ATen/InferSize.h>
@@ -1387,7 +1388,7 @@ Tensor reshape_symint(const Tensor& self, c10::SymIntArrayRef proposed_shape) {
     //
     // We need to do the checks here instead of in `native_functions.yaml`
     // to preserve backwards compatibility.
-    if (!self.is_xla() && !self.is_lazy() && !self.is_ipu()) {
+    if (!self.is_xla() && !self.is_lazy() && !self.is_ipu() && !at::isTensorSubclassLike(self)) {
       return self._reshape_alias_symint(shape, stride.value());
     } else {
       return self.view_symint(shape);
  File "<eval_with_key>.11", line 415, in forward
    view_793 = torch.ops.aten.view.default(permute_194, [sym_size_247, 16]);  permute_194 = sym_size_247 = None  File "/data/users/ezyang/pytorch-tmp/torch/_ops.py", line 257, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

which means this is a stride mismatch

ezyang commented 1 year ago

Also, cait_m36_384 fails on aot_eager too

ezyang commented 1 year ago

cait is a stride mismatch problem, https://github.com/pytorch/pytorch/pull/84246/commits/e8c2424eacbdb2f7598b256fd95b1f570eb06979

ezyang commented 1 year ago

cait still fails accuracy with inductor inference

(/home/ezyang/local/a/pytorch-env) [ezyang@devgpu020.ftw1 ~/local/a/pytorch (ab0e3db0)]$ CUDA_VISIBLE_DEVICES=1 python benchmarks/dynamo/timm_models.py --ci  --accuracy  --device cuda --inductor --only cait_m36_384
cuda eval  cait_m36_384                        [2023-02-02 07:16:01,597] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 1.16266, (ref-fp64): 0.00032 and shape=torch.Size([4, 1000])
FAIL
ezyang commented 1 year ago

ghostnet is fine