Closed dcaballe closed 1 month ago
Looks like it also regresses MobileBert on x86 for single-threaded
@hanhanW has anybody started root-causing this? I just marked it P0
People are busy on many things recently. No one started on this, AFAIK.
@mattwalsh for visibility as a P0 issue
From triage meeting: Assigning to Diego to review when he returns this Thursday.
@julianwa and @dcaballe I saw a draft of a more complete regression analysis (with a task list of all of the issues filed). Is that ready to promote here?
It looks like the regression is due to the differences on the imported MLIRs. So the root cause might not be in the LLVM bump but due to the tensorflow bump.
Here are the imported MLIRs of TFLite MobileBertSquad_fp32 before and after the integration:
Before the integration (51dbeb852a837e53e5724a1ed31ef1336ec3ccb4): https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4599127855/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir
After the integration (31bbc9597713acd856a806fae56545e3e148f8cc): https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4600147158/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir
I used the iree-compile
built on the integration commit to compile both MLIRs and got 38% of difference on the dispatch size:
Build command:
iree-compile input.mlir -o output.vmfb --iree-hal-target-backends=llvm-cpu --iree-input-type=tosa --iree-llvmcpu-target-triple=riscv64-pc-linux-gnu --iree-llvmcpu-target-cpu=generic-rv64 --iree-llvmcpu-target-abi=lp64d --iree-llvmcpu-target-cpu-features=+m,+a,+f,+d,+zvl512b,+v --riscv-v-fixed-length-vector-lmul-max=8 --iree-vm-emit-polyglot-zip=true --iree-llvmcpu-debug-symbols=false -mlir-disable-threading
Diff the imported MLIRs and https://github.com/tensorflow/tensorflow/commit/e0e966b6873b6ac8d09e50038fa6688d4161e220 is suspicious. It changes the way to convert softmax op into tosa.
I built iree-import-tflite
with that commit reverted in tenserflow, imported the model into MLIR, and compiled it. The total dispatch size returns to 26056 bytes, which supports the guess.
Before that commit, the importer converts the softmax op into:
op1 = tosa.exp(logits)
op2 = tosa.reduce_sum(op1)
op3 = tosa.reciprocal(op2)
op4 = tosa.mul(op1, op3)
After that commit, it generates an extra tosa.reduce_max
and tosa.sub
:
op1 = tosa.reduce_max(logits)
op2 = tosa.sub(logits, op1)
op3 = tosa.exp(op2)
op4 = tosa.reduce_sum(op3)
op5 = tosa.reciprocal(op4)
op6 = tosa.mul(op3, op5)
I haven't fully understood why they made that change in tensorflow side. The reason seems due to correctness.
I think that explains the size increase and performance regression. It seems like the problem is not on the integration. Instead we want to see how to generate better code for that pattern, or generate better tosa for the softmax op.
I think we can create another issue to improve the performance on tensorflow softmax op (and re-prioritize) then close this issue as intended behavior. @dcaballe would you mind suggesting the next step to do?
Thanks for looking into this! Is there an easy way we can measure the performance impact of reverting that commit? (I'm not sure if it's too late to just revert that commit in ToT and create a draft PR to run CI benchmarks...) That would help us be 100% sure that that commit is the only responsible for the regression. If it's too complicated, forget about it.
The analysis makes sense. If we add extra computation (reduction + sub ops) the execution time will increase. It's not clear to me if we are doing anything wrong or it's just the extra computation. Could you do a -mlir-print-ir-after-all
and post the IR after vectorization? Perhaps we distribute these ops across multiple dispatches. That should help with next steps.
For the rational behind the tosa change, maybe @rsuderman or @jpienaar can help?
Change was a correctness one addressing numerical stability, so doubt we can revert it.
After the integration, the pass iree-flow-raise-special-ops
actually identifies the new correct pattern of tosa ops and transforms them into iree_linalg_ext.softmax
. Since all iree_linalg_ext.softmax
have the same shape, it turns into a single dispatch:
func.func @main_dispatch_18() {
%c5308416 = arith.constant 5308416 : index
%c1769472 = arith.constant 1769472 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
%3 = tensor.empty() : tensor<4x384x384xf32>
%4 = iree_linalg_ext.softmax dimension(2) ins(%2 : tensor<4x384x384xf32>) outs(%3 : tensor<4x384x384xf32>) -> tensor<4x384x384xf32>
flow.dispatch.tensor.store %4, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
return
}
And being decomposed into:
// -----// IR Dump After DecomposeSoftmax (iree-linalg-ext-decompose-softmax) //----- //
func.func @main_dispatch_18() {
%c5308416 = arith.constant 5308416 : index
%c1769472 = arith.constant 1769472 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
%3 = tensor.empty() : tensor<4x384x384xf32>
%4 = tensor.empty() : tensor<4x384x384xf32>
%5 = tensor.empty() : tensor<4x384xf32>
%cst = arith.constant -1.000000e+30 : f32
%6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
%7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%2 : tensor<4x384x384xf32>) outs(%6 : tensor<4x384xf32>) {
^bb0(%in: f32, %out: f32):
%12 = arith.maxf %in, %out : f32
linalg.yield %12 : f32
} -> tensor<4x384xf32>
%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2, %7 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
^bb0(%in: f32, %in_1: f32, %out: f32):
%12 = arith.subf %in, %in_1 : f32
%13 = math.exp %12 : f32
linalg.yield %13 : f32
} -> tensor<4x384x384xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
%9 = linalg.fill ins(%cst_0 : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%8 : tensor<4x384x384xf32>) outs(%9 : tensor<4x384xf32>) {
^bb0(%in: f32, %out: f32):
%12 = arith.addf %in, %out : f32
linalg.yield %12 : f32
} -> tensor<4x384xf32>
%11 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%8, %10 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
^bb0(%in: f32, %in_1: f32, %out: f32):
%12 = arith.divf %in, %in_1 : f32
linalg.yield %12 : f32
} -> tensor<4x384x384xf32>
flow.dispatch.tensor.store %11, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
return
}
Then being vectorized into:
----- LinalgStrategyVectorizePass 3437914 -----
func.func @main_dispatch_18() {
%cst = arith.constant dense<-1.000000e+30> : vector<1x8xf32>
%cst_0 = arith.constant dense<0.000000e+00> : vector<1x8xf32>
%c32 = arith.constant 32 : index
%c1 = arith.constant 1 : index
%c8 = arith.constant 8 : index
%c0 = arith.constant 0 : index
%c384 = arith.constant 384 : index
%c4 = arith.constant 4 : index
%cst_1 = arith.constant 0.000000e+00 : f32
%c5308416 = arith.constant 5308416 : index
%c1769472 = arith.constant 1769472 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%2 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
%3 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
scf.for %arg0 = %2 to %c4 step %3 {
%4 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_id_x]
%5 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_count_x]
scf.for %arg1 = %4 to %c384 step %5 {
%6 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
%7 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
%8 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %6) -> (tensor<4x32x384xf32>) {
%9 = scf.for %arg4 = %c0 to %c32 step %c8 iter_args(%arg5 = %arg3) -> (tensor<4x32x384xf32>) {
%10 = tensor.empty() : tensor<1x8xf32>
%11 = vector.transfer_write %cst, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
%12 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %11) -> (tensor<1x8xf32>) {
%16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
%17 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%18 = vector.multi_reduction <maxf>, %16, %17 [2] : vector<1x8x32xf32> to vector<1x8xf32>
%19 = vector.transfer_write %18, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
scf.yield %19 : tensor<1x8xf32>
}
%13 = vector.transfer_write %cst_0, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
%14 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %13) -> (tensor<1x8xf32>) {
%16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
%17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
%19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
%20 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%21 = arith.subf %16, %19 : vector<1x8x32xf32>
%22 = math.exp %21 : vector<1x8x32xf32>
%23 = vector.multi_reduction <add>, %22, %20 [2] : vector<1x8x32xf32> to vector<1x8xf32>
%24 = vector.transfer_write %23, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
scf.yield %24 : tensor<1x8xf32>
}
%extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<4x32x384xf32> to tensor<1x8x384xf32>
%15 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %extracted_slice) -> (tensor<1x8x384xf32>) {
%16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
%17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
%19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
%20 = vector.transfer_read %14[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%21 = vector.broadcast %20 : vector<1x8xf32> to vector<32x1x8xf32>
%22 = vector.transpose %21, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
%23 = arith.subf %16, %19 : vector<1x8x32xf32>
%24 = math.exp %23 : vector<1x8x32xf32>
%25 = arith.divf %24, %22 : vector<1x8x32xf32>
%26 = vector.transfer_write %25, %arg7[%c0, %c0, %arg6] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x8x384xf32>
scf.yield %26 : tensor<1x8x384xf32>
}
%inserted_slice = tensor.insert_slice %15 into %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<1x8x384xf32> into tensor<4x32x384xf32>
scf.yield %inserted_slice : tensor<4x32x384xf32>
}
scf.yield %9 : tensor<4x32x384xf32>
}
flow.dispatch.tensor.store %8, %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : tensor<4x32x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
}
}
return
}
I also noticed that IREE didn't recognize the previous incorrect softmax pattern from tensorflow, so compiler didn't generate iree_linalg_ext.softmax
and just did normal dispatch formation to fuse those ops (instead of creating 1 big dispatch for the softmax, previously it generated 3 dispatches but also fused with its producer).
Here are the related dispatches generated from the previous softmax pattern (the first arith.addf
is the producer of the softmax op).
%69 = flow.dispatch.region[%65, %66, %67, %68] -> (tensor<4x384x1x384xf32>) {
%2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_1075, %9 : tensor<4x1x384x384xf32>, tensor<384x1x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
^bb0(%in: f32, %in_7714: f32, %out: f32):
%2567 = arith.addf %in, %in_7714 : f32
%2568 = math.exp %2567 : f32
linalg.yield %2568 : f32
} -> tensor<4x384x1x384xf32>
flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
flow.return %x, %y, %z : index, index, index
}
%collapsed_1089 = tensor.collapse_shape %69 [[0], [1], [2, 3]] : tensor<4x384x1x384xf32> into tensor<4x384x384xf32>
%70 = tensor.empty() : tensor<4x384xf32>
%71 = linalg.fill ins(%cst_27 : f32) outs(%70 : tensor<4x384xf32>) -> tensor<4x384xf32>
%c1_1090 = arith.constant 1 : index
%c0_1091 = arith.constant 0 : index
%c4_1092 = arith.constant 4 : index
%c1_1093 = arith.constant 1 : index
%72 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1091, %c4_1092, %c1_1093]
%c0_1094 = arith.constant 0 : index
%c384_1095 = arith.constant 384 : index
%c1_1096 = arith.constant 1 : index
%73 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1094, %c384_1095, %c1_1096]
%c0_1097 = arith.constant 0 : index
%c1_1098 = arith.constant 1 : index
%74 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1097, %c1_1090, %c1_1098]
%75 = flow.dispatch.region[%72, %73, %74] -> (tensor<4x384xf32>) {
%2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapsed_1089 : tensor<4x384x384xf32>) outs(%71 : tensor<4x384xf32>) {
^bb0(%in: f32, %out: f32):
%2567 = arith.addf %in, %out : f32
linalg.yield %2567 : f32
} -> tensor<4x384xf32>
flow.return %2566 : tensor<4x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5
flow.return %x, %y, %z : index, index, index
}
%c1_1099 = arith.constant 1 : index
%c0_1100 = arith.constant 0 : index
%c4_1101 = arith.constant 4 : index
%c1_1102 = arith.constant 1 : index
%76 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1100, %c4_1101, %c1_1102]
%c0_1103 = arith.constant 0 : index
%c384_1104 = arith.constant 384 : index
%c1_1105 = arith.constant 1 : index
%77 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1103, %c384_1104, %c1_1105]
%c0_1106 = arith.constant 0 : index
%c1_1107 = arith.constant 1 : index
%c1_1108 = arith.constant 1 : index
%78 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1106, %c1_1107, %c1_1108]
%c0_1109 = arith.constant 0 : index
%c384_1110 = arith.constant 384 : index
%c1_1111 = arith.constant 1 : index
%79 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1109, %c384_1110, %c1_1111]
%80 = flow.dispatch.region[%76, %77, %78, %79] -> (tensor<4x384x1x384xf32>) {
%2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%69, %75 : tensor<4x384x1x384xf32>, tensor<4x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
^bb0(%in: f32, %in_7714: f32, %out: f32):
%2567 = arith.divf %cst_25, %in_7714 : f32
%2568 = arith.mulf %in, %2567 : f32
linalg.yield %2568 : f32
} -> tensor<4x384x1x384xf32>
flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
flow.return %x, %y, %z : index, index, index
}
Is there an easy way we can measure the performance impact of reverting that commit?
Since the x86_64 regression isn't significant, I'll test with the risc-v benchmarks and post the results here.
Interesting, so you are saying it is due to the special case being recognized now that there is a regression?
Is there an easy way we can measure the performance impact of reverting that commit?
Tested with riscv. With that tensorflow commit reverted, the latency is same as the run before the integration. So it confirmed that tensorflow commit causes the regression.
@dcaballe or @hcindyl for input. Cindy, I am not sure if this is something you have seen/tracked/are concerned about. The regression on RISC-V seems to be because now softmax is fused into a single dispatch. This is actually the current expected state. Some input on how to handle RISC-V is useful. If such special case matching is not recommended on RISC-V we can think of disabling these for RISC-V. Ideally though, I would put this into the bucket of RISC-V issues.
Interesting, so you are saying it is due to the special case being recognized now that there is a regression?
Not totally. Now I tested with iree-flow-raise-special-ops
disabled (so no iree_lingalg_ext.softmax
). It only improves the binary size a little (36216 -> 31816 bytes, ~13%), because we still need to generate code for the extra two tosa.reduce_max
and tosa.sub
ops. The performance also only improves a little (~1.1X faster on RISC-V).
Not sure how much room left to improve but will need some more investigations if we want to push this further.
Great findings! Since the x86 regression seems to be justified by the extra computation, I would suggest that we stop here for now and move this to P2. If you are already running on the FPGA, perhaps it's worth that you quickly run perf and share the archived profiles internally. You can also post the reference commit you are using and we can follow up later. Thanks a lot for the analysis!
I would defer this to @dcaballe ; however, I am curious to see the perf of other models that contain softmax, such as https://github.com/openxla/iree/blob/ea0b28fe850a06123383cb1cf7671995362c06b1/build_tools/python/e2e_test_framework/models/tflite_models.py#L111-142. Is the hit only in the transformer encoder arch + RISCV (from [the](https://github.com/openxla/iree/issues/12916#issuecomment-1518242653 it seems x86 also takes a hit) ?
@pzread @dcaballe Any update on this P0?
Moving this to P2 as the x86 side is resolved. Only a better understanding of the RISC-V side is pending.
Nothing actionable here so much later.
Around 38% slowdown and 38% code size increase on MobileBert after https://github.com/openxla/iree/pull/12894.