Closed pdhirajkumarprasad closed 1 month ago
@pdhirajkumarprasad For this and all the other bugs you filed recently, could you please add this flag to --mlir-print-ir-after-all
to the iree-compile command and copy the dump to a file and share a link on the issue. it will help us understand which layer the issue is in and who to assign on it. cc @MaheshRavishankar
@pdhirajkumarprasad For this and all the other bugs you filed recently, could you please add this flag to
--mlir-print-ir-after-all
to the iree-compile command and copy the dump to a file and share a link on the issue. it will help us understand which layer the issue is in and who to assign on it. cc @MaheshRavishankar
@nirvedhmeshram I have updated all my issues with stage where it's failing along with version used.
Also this may be related with https://github.com/llvm/torch-mlir/pull/3630
@zjgarvey @rsuderman Can you provide some context input on this issue? As @pdhirajkumarprasad mentioned, it is possible that it is caused by it.
Yeah, with the PR linked above, this op will still fail to compile. I'll post a linalg reproducer for the next compilation issue here today. This is another op that will successfully compile through our ref-backend pipeline in torch-mlir e2e testing, but not through iree compile.
@zjgarvey The error that's showing over here is orthogonal to the IREE issue. It's not converting from torch to linalg. Maybe it was recently added and the torch-mlir submodule isn't bumped in IREE.
@pashu123 The PR here https://github.com/llvm/torch-mlir/pull/3630 resolves the torch-mlir error.
If someone wants to review that PR we can get it merged and work on the next issue for this op not compiling.
@zjgarvey So @pashu123 has approved the PR, and it looks clear. please proceed.
Ah, right. Here is the IR generated from an e2e test in torch mlir: "MultinomialModule2D_basic".
This successfully compiles through the ref backend pipeline in torch-mlir, but fails to compile with iree for llvm-cpu.
#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0)>
#map2 = affine_map<(d0, d1) -> ()>
#map3 = affine_map<() -> ()>
module attributes {torch.debug_module_name = "MultinomialModule2D"} {
ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64>
func.func @forward(%arg0: tensor<?x?xf64>) -> tensor<f64> {
%c1048576_i64 = arith.constant 1048576 : i64
%c0_i64 = arith.constant 0 : i64
%c1_i64 = arith.constant 1 : i64
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%cst = arith.constant 0.000000e+00 : f64
%c32_i64 = arith.constant 32 : i64
%c2_i64 = arith.constant 2 : i64
%c1048576 = arith.constant 1048576 : index
%cst_0 = arith.constant 5.4210107999999998E-20 : f64
%c6364136223846793005_i64 = arith.constant 6364136223846793005 : i64
%c1442695040888963407_i64 = arith.constant 1442695040888963407 : i64
%dim = tensor.dim %arg0, %c0 : tensor<?x?xf64>
%dim_1 = tensor.dim %arg0, %c1 : tensor<?x?xf64>
%0 = arith.index_cast %dim : index to i64
%1 = arith.index_cast %dim_1 : index to i64
%2 = tensor.empty(%dim) : tensor<?x1048576xi64>
%3 = tensor.empty(%dim) : tensor<?xf64>
%4 = linalg.fill ins(%cst : f64) outs(%3 : tensor<?xf64>) -> tensor<?xf64>
%5 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<?x?xf64>) outs(%4 : tensor<?xf64>) {
^bb0(%in: f64, %out: f64):
%13 = arith.addf %in, %out : f64
linalg.yield %13 : f64
} -> tensor<?xf64>
%6 = scf.for %arg1 = %c0_i64 to %0 step %c1_i64 iter_args(%arg2 = %2) -> (tensor<?x1048576xi64>) : i64 {
%13 = arith.index_cast %arg1 : i64 to index
%extracted = tensor.extract %5[%13] : tensor<?xf64>
%14 = tensor.empty(%dim_1) : tensor<?xf64>
%15 = scf.for %arg3 = %c0_i64 to %1 step %c1_i64 iter_args(%arg4 = %14) -> (tensor<?xf64>) : i64 {
%19 = arith.cmpi sgt, %arg3, %c0_i64 : i64
%20 = arith.index_cast %arg3 : i64 to index
%extracted_4 = tensor.extract %arg0[%13, %20] : tensor<?x?xf64>
%21 = arith.divf %extracted_4, %extracted : f64
%22 = scf.if %19 -> (f64) {
%23 = arith.subi %arg3, %c1_i64 : i64
%24 = arith.index_cast %23 : i64 to index
%extracted_6 = tensor.extract %arg4[%24] : tensor<?xf64>
%25 = arith.addf %21, %extracted_6 : f64
scf.yield %25 : f64
} else {
scf.yield %21 : f64
}
%inserted_5 = tensor.insert %22 into %arg4[%20] : tensor<?xf64>
scf.yield %inserted_5 : tensor<?xf64>
}
%global_seed = ml_program.global_load @global_seed : tensor<i64>
%extracted_3 = tensor.extract %global_seed[] : tensor<i64>
%16 = arith.muli %extracted_3, %c6364136223846793005_i64 : i64
%17 = arith.addi %16, %c1442695040888963407_i64 : i64
%inserted = tensor.insert %17 into %global_seed[] : tensor<i64>
ml_program.global_store @global_seed = %inserted : tensor<i64>
%18 = scf.for %arg3 = %c0_i64 to %c1048576_i64 step %c1_i64 iter_args(%arg4 = %arg2) -> (tensor<?x1048576xi64>) : i64 {
%19 = arith.muli %arg3, %17 : i64
%20 = arith.addi %19, %17 : i64
%21 = arith.muli %19, %19 : i64
%22 = arith.addi %21, %19 : i64
%23 = arith.shli %22, %c32_i64 : i64
%24 = arith.shrui %22, %c32_i64 : i64
%25 = arith.ori %23, %24 : i64
%26 = arith.muli %25, %25 : i64
%27 = arith.addi %26, %20 : i64
%28 = arith.shli %27, %c32_i64 : i64
%29 = arith.shrui %27, %c32_i64 : i64
%30 = arith.ori %28, %29 : i64
%31 = arith.muli %30, %30 : i64
%32 = arith.addi %31, %19 : i64
%33 = arith.shli %32, %c32_i64 : i64
%34 = arith.shrui %32, %c32_i64 : i64
%35 = arith.ori %33, %34 : i64
%36 = arith.muli %35, %35 : i64
%37 = arith.addi %36, %20 : i64
%38 = arith.shli %37, %c32_i64 : i64
%39 = arith.shrui %37, %c32_i64 : i64
%40 = arith.ori %38, %39 : i64
%41 = arith.muli %40, %40 : i64
%42 = arith.addi %41, %19 : i64
%43 = arith.shrui %42, %c32_i64 : i64
%44 = arith.xori %37, %43 : i64
%45 = arith.uitofp %44 : i64 to f64
%46 = arith.mulf %45, %cst_0 : f64
%47 = arith.addf %46, %cst : f64
%48:2 = scf.while (%arg5 = %c0_i64, %arg6 = %1) : (i64, i64) -> (i64, i64) {
%50 = arith.cmpi sgt, %arg6, %arg5 : i64
scf.condition(%50) %arg5, %arg6 : i64, i64
} do {
^bb0(%arg5: i64, %arg6: i64):
%50 = arith.subi %arg6, %arg5 : i64
%51 = arith.divsi %50, %c2_i64 : i64
%52 = arith.addi %arg5, %51 : i64
%53 = arith.index_cast %52 : i64 to index
%extracted_5 = tensor.extract %15[%53] : tensor<?xf64>
%54 = arith.cmpf olt, %extracted_5, %47 : f64
%55 = arith.select %54, %arg6, %52 : i64
%56 = scf.if %54 -> (i64) {
%57 = arith.addi %52, %c1_i64 : i64
scf.yield %57 : i64
} else {
scf.yield %arg5 : i64
}
scf.yield %56, %55 : i64, i64
}
%49 = arith.index_cast %arg3 : i64 to index
%inserted_4 = tensor.insert %48#0 into %arg4[%13, %49] : tensor<?x1048576xi64>
scf.yield %inserted_4 : tensor<?x1048576xi64>
}
scf.yield %18 : tensor<?x1048576xi64>
}
%7 = tensor.empty() : tensor<f64>
%8 = linalg.fill ins(%cst : f64) outs(%7 : tensor<f64>) -> tensor<f64>
%9 = linalg.generic {indexing_maps = [#map, #map2], iterator_types = ["reduction", "reduction"]} ins(%6 : tensor<?x1048576xi64>) outs(%8 : tensor<f64>) {
^bb0(%in: i64, %out: f64):
%13 = arith.sitofp %in : i64 to f64
%14 = arith.addf %13, %out : f64
linalg.yield %14 : f64
} -> tensor<f64>
%dim_2 = tensor.dim %6, %c0 : tensor<?x1048576xi64>
%10 = arith.muli %dim_2, %c1048576 : index
%11 = arith.index_cast %10 : index to i64
%12 = linalg.generic {indexing_maps = [#map3, #map3], iterator_types = []} ins(%9 : tensor<f64>) outs(%7 : tensor<f64>) {
^bb0(%in: f64, %out: f64):
%13 = arith.sitofp %11 : i64 to f64
%14 = arith.divf %in, %13 : f64
linalg.yield %14 : f64
} -> tensor<f64>
return %12 : tensor<f64>
}
}
repro:
iree-compile --iree-hal-target-backends=llvm-cpu repro.mlir -o repro.vmfb
hits an assertion:
iree-compile: iree/compiler/src/iree/compiler/Dialect/Util/Transforms/HoistIntoGlobals.cpp:105: auto mlir::iree_compiler::IREE::Util::(anonymous namespace)::HoistIntoGlobalsPass::runOnOperation()::(anonymous class)::operator()(mlir::Operation *) const: Assertion `resultInfo && "must have const-expr info"' failed.
ohk... we havent looked at programs like this. I have no idea what happens when we compile this. Might take some effort. Would be good to get a signal on the priority of this.
Changed it to me for now. It will take me a bit to look at this, but I can reprioritize it if needed.
ohk... we havent looked at programs like this. I have no idea what happens when we compile this. Might take some effort. Would be good to get a signal on the priority of this.
It's a failure in one of the migraphx models, so I imagine it's moderate to high priority.
Ok, but this seems like it is pretty much lowered to scalar code form the outset... I am not sure what the computation is and if we can lower it to something like linalg. All the scf.if
and scf.for
at this level is not what you expect into IREE. THere is anything really good we can do with such code (and yes we should be able to compile and get shitty performance, but that will take cycles).
yeah, that loop is going to be a problem (it's going to run entirely on the host in the VM interpreter) - tensorizing may help a bit but it really just shouldn't be scalarized like that.
For context:
The outermost scf loop is essentially over batches. Inside that, the first scf loop is doing a cumulative sum of the input distribution for the given batch. Maybe this could be converted to a TM tensor scan op instead. This step is done to generate a cumulative distribution function from the input distribution.
Afterwards, for each desired sample, it computes a "random" probability value from [0,1], then performs a binary search through the CDF outputs to find the last event whose cumulative probabilty is lesser or equal to the randomly selected probability value.
I'd start by looking at converting the ops if there are equivalents - it's absolutely required that those loops end up inside a dispatch region, and doing that with linalg ops (or things that lower to them or something adjacent) is pretty much the only way. Basic loops around tensor operations can be ok (though we have some things to do to make those better), but doing global stores inside the loop and any tensor.extract
/tensor.insert_slice
is on the "it'd be faster to do this work on pen and paper territory" and a big red flag when looking at models. You can get in the habit of grepping for tensor.extract
and if you find any of them know that you're performance is going to be somewhere between bad to indistinguishable-from-hangs :P
closing this one as open https://github.com/nod-ai/SHARK-ModelDev/issues/854 to track the front-end issue
What happened?
for the given IR
getting following error
IREE Version: IREE compiler version 20240819.990 @ aeda14995f16ed1302db616adf0c03acf80f27ee LLVM version 20.0.0git
Steps to reproduce your issue
Command to reproduce the issue:
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response