CUDA requesting shared memory size larger than allowed size

mariecwhite commented 1 year ago

What happened?

Getting this error for many models recently:

/work/runtime/src/iree/hal/drivers/cuda/native_executable.c:136: INTERNAL; CUDA driver error: Requested shared memory size of 421376 larger than allowed size of 166912; while invoking native function hal.executable.create; while calling import; 
[ 1]   native hal.executable.create:0 -
[ 0] bytecode module.__init:2050 <eval_with_key>.65:118:14

Steps to reproduce your issue

iree-compile --iree-hal-target-backends=cuda \
    --iree-input-type=none \
    --iree-hal-cuda-llvm-target-arch=sm_80 \
    linalg.mlir -o linalg.vmfb

Run:

iree-benchmark-module --module=linalg.vmfb \
    --function=forward \
    --input=1x4x64x64xf32=0 \
    --device_allocator=caching \
    --device=cuda://0

What component(s) does this issue relate to?

Runtime

Version information

Based on IREE SHA c6092c4

Additional context

Also seeing this in SHARK: https://github.com/nod-ai/SHARK/issues/1243

mariecwhite commented 1 year ago

FYI @monorimet

ThomasRaoux commented 1 year ago

I believe we had seen this issue before. Somehow a tensor.empty op is used as an operand of an elementwise op:

 %89 = tensor.empty() : tensor<1x4096x4096xf32>
    %90 = tensor.empty() : tensor<1x512x4096xf32>
    %91 = linalg.generic {indexing_maps = [#map5, #map6], iterator_types = ["parallel", "parallel", "parallel"]} ins(%collapsed_175 : tensor<1x4096x512xf32>) outs(%90 : tensor<1x512x4096xf32>) {
    ^bb0(%in: f32, %out: f32):
      linalg.yield %in : f32
    } -> tensor<1x512x4096xf32>
    %92 = linalg.fill ins(%cst_14 : f32) outs(%89 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
    %93 = linalg.batch_matmul ins(%collapsed_173, %91 : tensor<1x4096x512xf32>, tensor<1x512x4096xf32>) outs(%92 : tensor<1x4096x4096xf32>) -> tensor<1x4096x4096xf32>
    %94 = linalg.generic {indexing_maps = [#map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%93 : tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
    ^bb0(%in: f32, %out: f32):
      %716 = arith.mulf %in, %cst_9 : f32
      linalg.yield %716 : f32
    } -> tensor<1x4096x4096xf32>
    %95 = linalg.generic {indexing_maps = [#map9, #map9, #map5], iterator_types = ["parallel", "parallel", "parallel"]} ins(%94, %89 : tensor<1x4096x4096xf32>, tensor<1x4096x4096xf32>) outs(%89 : tensor<1x4096x4096xf32>) {
    ^bb0(%in: f32, %in_320: f32, %out: f32):
      %716 = arith.mulf %in_320, %cst_14 : f32
      %717 = arith.addf %in, %716 : f32
      linalg.yield %717 : f32
    } -> tensor<1x4096x4096xf32>

Is this a front end problem? @ramiro050, would you know?

ramiro050 commented 1 year ago

Is this a front end problem? @ramiro050, would you know?

Might be. Normally we do zero out tensors before passing them to linalg.generic, but this might be a case that got missed. @mariecwhite, do you have the torch-dialect MLIR for this model?

mariecwhite commented 1 year ago

Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir

ramiro050 commented 1 year ago

Is that the mlir from calling torch-mlir compile? It's here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/linalg.mlir

torch_mlir.compile but with output_type="torch"

mariecwhite commented 1 year ago

Here it is: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/vae_torch.mlir

ramiro050 commented 1 year ago

There is an empty tensor being fed to the torch.aten.add.Tensor op

    %258 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !to
rch.list<int>
    %259 = torch.aten.empty.memory_format %258, %int6, %none, %cpu, %false, %none : !torch.list<int>, !torc
h.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.vtensor<[1,4096,4096],f32>
    %260 = torch.aten.transpose.int %254, %int-1, %int-2 : !torch.vtensor<[1,4096,512],f32>, !torch.int, !t
orch.int -> !torch.vtensor<[1,512,4096],f32>
    %261 = torch.aten.bmm %251, %260 : !torch.vtensor<[1,4096,512],f32>, !torch.vtensor<[1,512,4096],f32> -> !torch.vtensor<[1,4096,4096],f32>
    %262 = torch.aten.mul.Scalar %261, %float4.419420e-02 : !torch.vtensor<[1,4096,4096],f32>, !torch.float -> !torch.vtensor<[1,4096,4096],f32>
    %263 = torch.aten.add.Tensor %262, %259, %int0 : !torch.vtensor<[1,4096,4096],f32>, !torch.vtensor<[1,4096,4096],f32>, !torch.int -> !torch.vtensor<[1,4096,4096],f32>

While it could still be a bug in torch-mlir that results in this (chances are small), this could also be a bug in the model itself.

@mariecwhite, can you link one last IR: output_type="raw"? This will allow me to say for sure if torch-mlir is causing this or not.

mariecwhite commented 1 year ago

Raw uploaded here: https://storage.googleapis.com/iree-model-artifacts/pytorch/torch_models_20230321.784_1679461251/SD_VAE_MODEL/batch_1/vae_raw_after_fx.mlir

ramiro050 commented 1 year ago

This seems to be a bug in the model, not in torch-mlir:

    %278 = torch.prim.ListConstruct %int1, %int4096, %int4096 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %279 = torch.aten.empty.memory_format %278, %int6, %none_0, %cpu, %false, %none_0 : !torch.list<int>, !torch.int, !torch.none, !torch.Device, !torch.bool, !torch.none -> !torch.tensor
    %280 = torch.aten.transpose.int %271, %int-1, %int-2 : !torch.tensor, !torch.int, !torch.int -> !torch.tensor
    %281 = torch.aten.baddbmm %279, %265, %280, %int0, %float4.419420e-02 : !torch.tensor, !torch.tensor, !torch.tensor, !torch.int, !torch.float -> !torch.tensor

An empty tensor is being used as the first argument for baddbmm. This also appears in the Python code string that FX graphs have attached to them:

    empty = torch.ops.aten.empty([1, 4096, 4096], dtype = torch.float32, device = device(type='cpu'), pin_memory = False)
    transpose_1 = torch.ops.aten.transpose(view_13, -1, -2);  view_13 = None
    baddbmm = torch.ops.aten.baddbmm(empty, view_11, transpose_1, beta = 0, alpha = 0.044194173824159216);  empty = view_11 = transpose_1 = None

allieculp commented 1 year ago

@ramiro050 Can this be closed? Or is there additional questions here?

ramiro050 commented 1 year ago

If the CUDA error is caused by the zero tensor being used as an argument, then this seems to be an issue with the model and not with IREE. @mariecwhite, can you confirm?

mariecwhite commented 1 year ago

My implementation provides non-zero tensors as input. There is something else going on. @ramiro050 is there a way to visualize the FX graph?

ramiro050 commented 1 year ago

Sorry, I meant empty tensors being used as arguments in the baddbmm op. If you have a torch.fx graph module, you can print the graph by doing print(my_module.graph). You can also see the Python code representation by doing print(my_module.code)

allieculp commented 1 year ago

@ramiro050 @mariecwhite Any update on this one?

mariecwhite commented 1 year ago

I haven't had cycles to look into this. I'll try and look into it next week.

iree-org / iree