Open mariecwhite opened 1 year ago
FYI @monorimet
I believe we had seen this issue before. Somehow a tensor.empty
op is used as an operand of an elementwise op:
%89 = tensor.empty() : tensor<1x4096x4096xf32>
%90 = tensor.empty() : tensor<1x512x4096xf32>
%91 = linalg.generic {indexing_maps = [#map5, #map6], iterator_types = ["parallel", "parallel", "parallel"]} ins(%collapsed_175 : tensor<1x4096x512xf32>) outs(%90 : tensor<1x512x4096xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<1x512x4096xf32>
%92 = linalg.fill ins(%cst_14 : f32) outs(%89 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
%93 = linalg.batch_matmul ins(%collapsed_173, %91 : tensor<1x4096x512xf32>, tensor<1x512x4096xf32>) outs(%92 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
%94 = linalg.generic {indexing_maps = [#map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%93 : tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
^bb0(%in: f32, %out: f32):
%716 = arith.mulf %in, %cst_9 : f32
linalg.yield %716 : f32
} -> tensor<1x4096x4096xf32>
%95 = linalg.generic {indexing_maps = [#map9, #map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%94, %89 : tensor<1x4096x4096xf32>, tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
^bb0(%in: f32, %in_320: f32, %out: f32):
%716 = arith.mulf %in_320, %cst_14 : f32
%717 = arith.addf %in, %716 : f32
linalg.yield %717 : f32
} -> tensor<1x4096x4096xf32>
Is this a front end problem? @ramiro050, would you know?
Is this a front end problem? @ramiro050, would you know?
Might be. Normally we do zero out tensors before passing them to linalg.generic
, but this might be a case that got missed. @mariecwhite, do you have the torch-dialect MLIR for this model?
Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir
Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir
torch_mlir.compile
but with output_type="torch"
There is an empty tensor being fed to the torch.aten.add.Tensor
op
%258 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !to
rch.list<int>
%259 = torch.aten.empty.memory_format %258, %int6, %none, %cpu, %false, %none : !torch.list<int>, !torc
h.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.vtensor<[1,4096,4096],f32>
%260 = torch.aten.transpose.int %254, %int-1, %int-2 : !torch.vtensor<[1,4096,512],f32>, !torch.int, !t
orch.int -> !torch.vtensor<[1,512,4096],f32>
%261 = torch.aten.bmm %251, %260 : !torch.vtensor<[1,4096,512],f32>, !torch.vtensor<[1,512,4096],f32> -> !torch.vtensor<[1,4096,4096],f32>
%262 = torch.aten.mul.Scalar %261, %float4.419420e-02 : !torch.vtensor<[1,4096,4096],f32>, !torch.float -> !torch.vtensor<[1,4096,4096],f32>
%263 = torch.aten.add.Tensor %262, %259, %int0 : !torch.vtensor<[1,4096,4096],f32>, !torch.vtensor<[1,4096,4096],f32>, !torch.int -> !torch.vtensor<[1,4096,4096],f32>
While it could still be a bug in torch-mlir that results in this (chances are small), this could also be a bug in the model itself.
@mariecwhite, can you link one last IR: output_type="raw"
? This will allow me to say for sure if torch-mlir is causing this or not.
This seems to be a bug in the model, not in torch-mlir:
%278 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
%279 = torch.aten.empty.memory_format %278, %int6, %none_0, %cpu, %false, %none_0 : !torch.list<int>, !torch.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.tensor
%280 = torch.aten.transpose.int %271, %int-1, %int-2 : !torch.tensor, !torch.int, !torch.int -> !torch.tensor
%281 = torch.aten.baddbmm %279, %265, %280, %int0, %float4.419420e-02 : !torch.tensor, !torch.tensor, !torch.tensor, !torch.int, !torch.float -> !torch.tensor
An empty tensor is being used as the first argument for baddbmm
. This also appears in the Python code string that FX graphs have attached to them:
empty = torch.ops.aten.empty([1, 4096, 4096], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
transpose_1 = torch.ops.aten.transpose(view_13, -1, -2); view_13 = None
baddbmm = torch.ops.aten.baddbmm(empty, view_11, transpose_1, beta = 0, alpha = 0.044194173824159216); empty = view_11 = transpose_1 = None
@ramiro050 Can this be closed? Or is there additional questions here?
If the CUDA error is caused by the zero tensor being used as an argument, then this seems to be an issue with the model and not with IREE. @mariecwhite, can you confirm?
My implementation provides non-zero tensors as input. There is something else going on. @ramiro050 is there a way to visualize the FX graph?
Sorry, I meant empty
tensors being used as arguments in the baddbmm
op. If you have a torch.fx
graph module, you can print the graph by doing print(my_module.graph)
. You can also see the Python code representation by doing print(my_module.code)
@ramiro050 @mariecwhite Any update on this one?
I haven't had cycles to look into this. I'll try and look into it next week.
What happened?
Getting this error for many models recently:
Steps to reproduce your issue
Download https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir
Compile:
What component(s) does this issue relate to?
Runtime
Version information
Based on IREE SHA c6092c4
Additional context
Also seeing this in SHARK: https://github.com/nod-ai/SHARK/issues/1243