iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.84k stars 614 forks source link

[compilation][cpu]: failed to legalize operation onnx.Multinomial #18268

Closed pdhirajkumarprasad closed 1 month ago

pdhirajkumarprasad commented 2 months ago

What happened?

for the given IR

module {
  func.func @"torch-jit-export"( %arg6: !torch.vtensor<[?,4],f32>) -> (!torch.vtensor<[?,1],si64>) attributes {torch.onnx_meta.ir_version = 6 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.producer_name = "pytorch", torch.onnx_meta.producer_version = "1.7"} {
    %82 = torch.operator "onnx.Multinomial"(%arg6) {torch.onnx.dtype = 7 : si64, torch.onnx.sample_size = 1 : si64} : (!torch.vtensor<[?,4],f32>) -> !torch.vtensor<[?,1],si64> 
    return %82: !torch.vtensor<[?,1],si64>
  }
}

getting following error

t1.mlir:3:11: error: 'arith.addf' op requires the same type for all operands and results
    %82 = torch.operator "onnx.Multinomial"(%arg6) {torch.onnx.dtype = 7 : si64, torch.onnx.sample_size = 1 : si64} : (!torch.vtensor<[?,4],f32>) -> !torch.vtensor<[?,1],si64> 
          ^
// -----// IR Dump After ConvertTorchToLinalg Failed (convert-torch-to-linalg) //----- //
"func.func"() <{function_type = (!torch.vtensor<[?,4],f32>) -> !torch.vtensor<[?,1],si64>, sym_name = "torch-jit-export"}> ({
^bb0(%arg0: !torch.vtensor<[?,4],f32> loc("t1.mlir":2:34)):
  %0 = "torch_c.to_builtin_tensor"(%arg0) : (!torch.vtensor<[?,4],f32>) -> tensor<?x4xf32> loc("t1.mlir":3:11)
  %1 = "torch.constant.int"() <{value = 1 : i64}> : () -> !torch.int loc("t1.mlir":3:11)
  %2 = "arith.constant"() <{value = 1 : i64}> : () -> i64 loc("t1.mlir":3:11)
  %3 = "torch.constant.none"() : () -> !torch.none loc("t1.mlir":3:11)
  %4 = "torch.constant.bool"() <{value = true}> : () -> !torch.bool loc("t1.mlir":3:11)
  %5 = "arith.constant"() <{value = 0 : i64}> : () -> i64 loc("t1.mlir":3:11)
  %6 = "arith.constant"() <{value = 1 : i64}> : () -> i64 loc("t1.mlir":3:11)
  %7 = "arith.constant"() <{value = 0 : index}> : () -> index loc("t1.mlir":3:11)
  %8 = "arith.constant"() <{value = 1 : index}> : () -> index loc("t1.mlir":3:11)
  %9 = "arith.index_cast"(%2) : (i64) -> index loc("t1.mlir":3:11)
  %10 = "tensor.dim"(%0, %7) : (tensor<?x4xf32>, index) -> index loc("t1.mlir":3:11)
  %11 = "tensor.dim"(%0, %8) : (tensor<?x4xf32>, index) -> index loc("t1.mlir":3:11)

IREE Version: IREE compiler version 20240819.990 @ aeda14995f16ed1302db616adf0c03acf80f27ee LLVM version 20.0.0git

Steps to reproduce your issue

Command to reproduce the issue:

iree-compile --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu model.torch_onnx.mlir

What component(s) does this issue relate to?

Compiler

Version information

No response

Additional context

No response

nirvedhmeshram commented 2 months ago

@pdhirajkumarprasad For this and all the other bugs you filed recently, could you please add this flag to --mlir-print-ir-after-all to the iree-compile command and copy the dump to a file and share a link on the issue. it will help us understand which layer the issue is in and who to assign on it. cc @MaheshRavishankar

pdhirajkumarprasad commented 2 months ago

@pdhirajkumarprasad For this and all the other bugs you filed recently, could you please add this flag to --mlir-print-ir-after-all to the iree-compile command and copy the dump to a file and share a link on the issue. it will help us understand which layer the issue is in and who to assign on it. cc @MaheshRavishankar

@nirvedhmeshram I have updated all my issues with stage where it's failing along with version used.

pdhirajkumarprasad commented 2 months ago

Also this may be related with https://github.com/llvm/torch-mlir/pull/3630

lialan commented 2 months ago

@zjgarvey @rsuderman Can you provide some context input on this issue? As @pdhirajkumarprasad mentioned, it is possible that it is caused by it.

zjgarvey commented 2 months ago

Yeah, with the PR linked above, this op will still fail to compile. I'll post a linalg reproducer for the next compilation issue here today. This is another op that will successfully compile through our ref-backend pipeline in torch-mlir e2e testing, but not through iree compile.

pashu123 commented 2 months ago

@zjgarvey The error that's showing over here is orthogonal to the IREE issue. It's not converting from torch to linalg. Maybe it was recently added and the torch-mlir submodule isn't bumped in IREE.

zjgarvey commented 2 months ago

@pashu123 The PR here https://github.com/llvm/torch-mlir/pull/3630 resolves the torch-mlir error.

zjgarvey commented 2 months ago

If someone wants to review that PR we can get it merged and work on the next issue for this op not compiling.

lialan commented 2 months ago

@zjgarvey So @pashu123 has approved the PR, and it looks clear. please proceed.

zjgarvey commented 2 months ago

Ah, right. Here is the IR generated from an e2e test in torch mlir: "MultinomialModule2D_basic".

This successfully compiles through the ref backend pipeline in torch-mlir, but fails to compile with iree for llvm-cpu.

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0)>
#map2 = affine_map<(d0, d1) -> ()>
#map3 = affine_map<() -> ()>
module attributes {torch.debug_module_name = "MultinomialModule2D"} {
  ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
  func.func @forward(%arg0: tensor<?x?xf64>) -> tensor<f64> {
    %c1048576_i64 = arith.constant 1048576 : i64
    %c0_i64 = arith.constant 0 : i64
    %c1_i64 = arith.constant 1 : i64
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %cst = arith.constant 0.000000e+00 : f64
    %c32_i64 = arith.constant 32 : i64
    %c2_i64 = arith.constant 2 : i64
    %c1048576 = arith.constant 1048576 : index
    %cst_0 = arith.constant 5.4210107999999998E-20 : f64
    %c6364136223846793005_i64 = arith.constant 6364136223846793005 : i64
    %c1442695040888963407_i64 = arith.constant 1442695040888963407 : i64
    %dim = tensor.dim %arg0, %c0 : tensor<?x?xf64>
    %dim_1 = tensor.dim %arg0, %c1 : tensor<?x?xf64>
    %0 = arith.index_cast %dim : index to i64
    %1 = arith.index_cast %dim_1 : index to i64
    %2 = tensor.empty(%dim) : tensor<?x1048576xi64>
    %3 = tensor.empty(%dim) : tensor<?xf64>
    %4 = linalg.fill ins(%cst : f64) outs(%3 : tensor<?xf64>) -> tensor<?xf64>
    %5 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<?x?xf64>) outs(%4 : tensor<?xf64>) {
    ^bb0(%in: f64, %out: f64):
      %13 = arith.addf %in, %out : f64
      linalg.yield %13 : f64
    } -> tensor<?xf64>
    %6 = scf.for %arg1 = %c0_i64 to %0 step %c1_i64 iter_args(%arg2 = %2) -> (tensor<?x1048576xi64>)  : i64 {
      %13 = arith.index_cast %arg1 : i64 to index
      %extracted = tensor.extract %5[%13] : tensor<?xf64>
      %14 = tensor.empty(%dim_1) : tensor<?xf64>
      %15 = scf.for %arg3 = %c0_i64 to %1 step %c1_i64 iter_args(%arg4 = %14) -> (tensor<?xf64>)  : i64 {
        %19 = arith.cmpi sgt, %arg3, %c0_i64 : i64
        %20 = arith.index_cast %arg3 : i64 to index
        %extracted_4 = tensor.extract %arg0[%13, %20] : tensor<?x?xf64>
        %21 = arith.divf %extracted_4, %extracted : f64
        %22 = scf.if %19 -> (f64) {
          %23 = arith.subi %arg3, %c1_i64 : i64
          %24 = arith.index_cast %23 : i64 to index
          %extracted_6 = tensor.extract %arg4[%24] : tensor<?xf64>
          %25 = arith.addf %21, %extracted_6 : f64
          scf.yield %25 : f64
        } else {
          scf.yield %21 : f64
        }
        %inserted_5 = tensor.insert %22 into %arg4[%20] : tensor<?xf64>
        scf.yield %inserted_5 : tensor<?xf64>
      }
      %global_seed = ml_program.global_load @global_seed : tensor<i64>
      %extracted_3 = tensor.extract %global_seed[] : tensor<i64>
      %16 = arith.muli %extracted_3, %c6364136223846793005_i64 : i64
      %17 = arith.addi %16, %c1442695040888963407_i64 : i64
      %inserted = tensor.insert %17 into %global_seed[] : tensor<i64>
      ml_program.global_store @global_seed = %inserted : tensor<i64>
      %18 = scf.for %arg3 = %c0_i64 to %c1048576_i64 step %c1_i64 iter_args(%arg4 = %arg2) -> (tensor<?x1048576xi64>)  : i64 {
        %19 = arith.muli %arg3, %17 : i64
        %20 = arith.addi %19, %17 : i64
        %21 = arith.muli %19, %19 : i64
        %22 = arith.addi %21, %19 : i64
        %23 = arith.shli %22, %c32_i64 : i64
        %24 = arith.shrui %22, %c32_i64 : i64
        %25 = arith.ori %23, %24 : i64
        %26 = arith.muli %25, %25 : i64
        %27 = arith.addi %26, %20 : i64
        %28 = arith.shli %27, %c32_i64 : i64
        %29 = arith.shrui %27, %c32_i64 : i64
        %30 = arith.ori %28, %29 : i64
        %31 = arith.muli %30, %30 : i64
        %32 = arith.addi %31, %19 : i64
        %33 = arith.shli %32, %c32_i64 : i64
        %34 = arith.shrui %32, %c32_i64 : i64
        %35 = arith.ori %33, %34 : i64
        %36 = arith.muli %35, %35 : i64
        %37 = arith.addi %36, %20 : i64
        %38 = arith.shli %37, %c32_i64 : i64
        %39 = arith.shrui %37, %c32_i64 : i64
        %40 = arith.ori %38, %39 : i64
        %41 = arith.muli %40, %40 : i64
        %42 = arith.addi %41, %19 : i64
        %43 = arith.shrui %42, %c32_i64 : i64
        %44 = arith.xori %37, %43 : i64
        %45 = arith.uitofp %44 : i64 to f64
        %46 = arith.mulf %45, %cst_0 : f64
        %47 = arith.addf %46, %cst : f64
        %48:2 = scf.while (%arg5 = %c0_i64, %arg6 = %1) : (i64, i64) -> (i64, i64) {
          %50 = arith.cmpi sgt, %arg6, %arg5 : i64
          scf.condition(%50) %arg5, %arg6 : i64, i64
        } do {
        ^bb0(%arg5: i64, %arg6: i64):
          %50 = arith.subi %arg6, %arg5 : i64
          %51 = arith.divsi %50, %c2_i64 : i64
          %52 = arith.addi %arg5, %51 : i64
          %53 = arith.index_cast %52 : i64 to index
          %extracted_5 = tensor.extract %15[%53] : tensor<?xf64>
          %54 = arith.cmpf olt, %extracted_5, %47 : f64
          %55 = arith.select %54, %arg6, %52 : i64
          %56 = scf.if %54 -> (i64) {
            %57 = arith.addi %52, %c1_i64 : i64
            scf.yield %57 : i64
          } else {
            scf.yield %arg5 : i64
          }
          scf.yield %56, %55 : i64, i64
        }
        %49 = arith.index_cast %arg3 : i64 to index
        %inserted_4 = tensor.insert %48#0 into %arg4[%13, %49] : tensor<?x1048576xi64>
        scf.yield %inserted_4 : tensor<?x1048576xi64>
      }
      scf.yield %18 : tensor<?x1048576xi64>
    }
    %7 = tensor.empty() : tensor<f64>
    %8 = linalg.fill ins(%cst : f64) outs(%7 : tensor<f64>) -> tensor<f64>
    %9 = linalg.generic {indexing_maps = [#map, #map2], iterator_types = ["reduction", "reduction"]} ins(%6 : tensor<?x1048576xi64>) outs(%8 : tensor<f64>) {
    ^bb0(%in: i64, %out: f64):
      %13 = arith.sitofp %in : i64 to f64
      %14 = arith.addf %13, %out : f64
      linalg.yield %14 : f64
    } -> tensor<f64>
    %dim_2 = tensor.dim %6, %c0 : tensor<?x1048576xi64>
    %10 = arith.muli %dim_2, %c1048576 : index
    %11 = arith.index_cast %10 : index to i64
    %12 = linalg.generic {indexing_maps = [#map3, #map3], iterator_types = []} ins(%9 : tensor<f64>) outs(%7 : tensor<f64>) {
    ^bb0(%in: f64, %out: f64):
      %13 = arith.sitofp %11 : i64 to f64
      %14 = arith.divf %in, %13 : f64
      linalg.yield %14 : f64
    } -> tensor<f64>
    return %12 : tensor<f64>
  }
}

repro:

iree-compile --iree-hal-target-backends=llvm-cpu repro.mlir -o repro.vmfb

hits an assertion:

iree-compile: iree/compiler/src/iree/compiler/Dialect/Util/Transforms/HoistIntoGlobals.cpp:105: auto mlir::iree_compiler::IREE::Util::(anonymous namespace)::HoistIntoGlobalsPass::runOnOperation()::(anonymous class)::operator()(mlir::Operation *) const: Assertion `resultInfo && "must have const-expr info"' failed.
MaheshRavishankar commented 2 months ago

ohk... we havent looked at programs like this. I have no idea what happens when we compile this. Might take some effort. Would be good to get a signal on the priority of this.

MaheshRavishankar commented 2 months ago

Changed it to me for now. It will take me a bit to look at this, but I can reprioritize it if needed.

zjgarvey commented 2 months ago

ohk... we havent looked at programs like this. I have no idea what happens when we compile this. Might take some effort. Would be good to get a signal on the priority of this.

It's a failure in one of the migraphx models, so I imagine it's moderate to high priority.

MaheshRavishankar commented 2 months ago

Ok, but this seems like it is pretty much lowered to scalar code form the outset... I am not sure what the computation is and if we can lower it to something like linalg. All the scf.if and scf.for at this level is not what you expect into IREE. THere is anything really good we can do with such code (and yes we should be able to compile and get shitty performance, but that will take cycles).

benvanik commented 2 months ago

yeah, that loop is going to be a problem (it's going to run entirely on the host in the VM interpreter) - tensorizing may help a bit but it really just shouldn't be scalarized like that.

zjgarvey commented 2 months ago

For context:

The outermost scf loop is essentially over batches. Inside that, the first scf loop is doing a cumulative sum of the input distribution for the given batch. Maybe this could be converted to a TM tensor scan op instead. This step is done to generate a cumulative distribution function from the input distribution.

Afterwards, for each desired sample, it computes a "random" probability value from [0,1], then performs a binary search through the CDF outputs to find the last event whose cumulative probabilty is lesser or equal to the randomly selected probability value.

benvanik commented 2 months ago

I'd start by looking at converting the ops if there are equivalents - it's absolutely required that those loops end up inside a dispatch region, and doing that with linalg ops (or things that lower to them or something adjacent) is pretty much the only way. Basic loops around tensor operations can be ok (though we have some things to do to make those better), but doing global stores inside the loop and any tensor.extract/tensor.insert_slice is on the "it'd be faster to do this work on pen and paper territory" and a big red flag when looking at models. You can get in the habit of grepping for tensor.extract and if you find any of them know that you're performance is going to be somewhere between bad to indistinguishable-from-hangs :P

pdhirajkumarprasad commented 1 month ago

closing this one as open https://github.com/nod-ai/SHARK-ModelDev/issues/854 to track the front-end issue