iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.59k stars 580 forks source link

[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

Closed dcaballe closed 1 month ago

dcaballe commented 1 year ago

Around 38% slowdown and 38% code size increase on MobileBert after https://github.com/openxla/iree/pull/12894.

hanhanW commented 1 year ago

Looks like it also regresses MobileBert on x86 for single-threaded

image
julianwa commented 1 year ago

@hanhanW has anybody started root-causing this? I just marked it P0

hanhanW commented 1 year ago

People are busy on many things recently. No one started on this, AFAIK.

allieculp commented 1 year ago

@mattwalsh for visibility as a P0 issue

allieculp commented 1 year ago

From triage meeting: Assigning to Diego to review when he returns this Thursday.

aaron-schneider commented 1 year ago

@julianwa and @dcaballe I saw a draft of a more complete regression analysis (with a task list of all of the issues filed). Is that ready to promote here?

dcaballe commented 1 year ago

https://github.com/orgs/openxla/projects/13/views/30?query=is%3Aopen+sort%3Aupdated-desc&pane=issue&itemId=25883714

pzread commented 1 year ago

It looks like the regression is due to the differences on the imported MLIRs. So the root cause might not be in the LLVM bump but due to the tensorflow bump.

Here are the imported MLIRs of TFLite MobileBertSquad_fp32 before and after the integration:

Before the integration (51dbeb852a837e53e5724a1ed31ef1336ec3ccb4): https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4599127855/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir

After the integration (31bbc9597713acd856a806fae56545e3e148f8cc): https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4600147158/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir

I used the iree-compile built on the integration commit to compile both MLIRs and got 38% of difference on the dispatch size:

Build command:

iree-compile input.mlir -o output.vmfb --iree-hal-target-backends=llvm-cpu --iree-input-type=tosa --iree-llvmcpu-target-triple=riscv64-pc-linux-gnu --iree-llvmcpu-target-cpu=generic-rv64 --iree-llvmcpu-target-abi=lp64d --iree-llvmcpu-target-cpu-features=+m,+a,+f,+d,+zvl512b,+v --riscv-v-fixed-length-vector-lmul-max=8 --iree-vm-emit-polyglot-zip=true --iree-llvmcpu-debug-symbols=false -mlir-disable-threading
pzread commented 1 year ago

Diff the imported MLIRs and https://github.com/tensorflow/tensorflow/commit/e0e966b6873b6ac8d09e50038fa6688d4161e220 is suspicious. It changes the way to convert softmax op into tosa.

I built iree-import-tflite with that commit reverted in tenserflow, imported the model into MLIR, and compiled it. The total dispatch size returns to 26056 bytes, which supports the guess.

Before that commit, the importer converts the softmax op into:

op1 = tosa.exp(logits)
op2 = tosa.reduce_sum(op1)
op3 = tosa.reciprocal(op2)
op4 = tosa.mul(op1, op3)

After that commit, it generates an extra tosa.reduce_max and tosa.sub:

op1 = tosa.reduce_max(logits)
op2 = tosa.sub(logits, op1)
op3 = tosa.exp(op2)
op4 = tosa.reduce_sum(op3)
op5 = tosa.reciprocal(op4)
op6 = tosa.mul(op3, op5)

I haven't fully understood why they made that change in tensorflow side. The reason seems due to correctness.

I think that explains the size increase and performance regression. It seems like the problem is not on the integration. Instead we want to see how to generate better code for that pattern, or generate better tosa for the softmax op.

pzread commented 1 year ago

I think we can create another issue to improve the performance on tensorflow softmax op (and re-prioritize) then close this issue as intended behavior. @dcaballe would you mind suggesting the next step to do?

dcaballe commented 1 year ago

Thanks for looking into this! Is there an easy way we can measure the performance impact of reverting that commit? (I'm not sure if it's too late to just revert that commit in ToT and create a draft PR to run CI benchmarks...) That would help us be 100% sure that that commit is the only responsible for the regression. If it's too complicated, forget about it.

The analysis makes sense. If we add extra computation (reduction + sub ops) the execution time will increase. It's not clear to me if we are doing anything wrong or it's just the extra computation. Could you do a -mlir-print-ir-after-all and post the IR after vectorization? Perhaps we distribute these ops across multiple dispatches. That should help with next steps.

For the rational behind the tosa change, maybe @rsuderman or @jpienaar can help?

jpienaar commented 1 year ago

Change was a correctness one addressing numerical stability, so doubt we can revert it.

pzread commented 1 year ago

After the integration, the pass iree-flow-raise-special-ops actually identifies the new correct pattern of tosa ops and transforms them into iree_linalg_ext.softmax. Since all iree_linalg_ext.softmax have the same shape, it turns into a single dispatch:

func.func @main_dispatch_18() {
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
  %3 = tensor.empty() : tensor<4x384x384xf32>
  %4 = iree_linalg_ext.softmax dimension(2) ins(%2 : tensor<4x384x384xf32>) outs(%3 : tensor<4x384x384xf32>) -> tensor<4x384x384xf32>
  flow.dispatch.tensor.store %4, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  return
}

And being decomposed into:

// -----// IR Dump After DecomposeSoftmax (iree-linalg-ext-decompose-softmax) //----- //
func.func @main_dispatch_18() {
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
  %3 = tensor.empty() : tensor<4x384x384xf32>
  %4 = tensor.empty() : tensor<4x384x384xf32>
  %5 = tensor.empty() : tensor<4x384xf32>
  %cst = arith.constant -1.000000e+30 : f32
  %6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
  %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%2 : tensor<4x384x384xf32>) outs(%6 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %12 = arith.maxf %in, %out : f32
    linalg.yield %12 : f32
  } -> tensor<4x384xf32>
  %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2, %7 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
  ^bb0(%in: f32, %in_1: f32, %out: f32):
    %12 = arith.subf %in, %in_1 : f32
    %13 = math.exp %12 : f32
    linalg.yield %13 : f32
  } -> tensor<4x384x384xf32>
  %cst_0 = arith.constant 0.000000e+00 : f32
  %9 = linalg.fill ins(%cst_0 : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
  %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%8 : tensor<4x384x384xf32>) outs(%9 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %12 = arith.addf %in, %out : f32
    linalg.yield %12 : f32
  } -> tensor<4x384xf32>
  %11 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%8, %10 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
  ^bb0(%in: f32, %in_1: f32, %out: f32):
    %12 = arith.divf %in, %in_1 : f32
    linalg.yield %12 : f32
  } -> tensor<4x384x384xf32>
  flow.dispatch.tensor.store %11, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  return
}

Then being vectorized into:

----- LinalgStrategyVectorizePass 3437914 -----
func.func @main_dispatch_18() {
  %cst = arith.constant dense<-1.000000e+30> : vector<1x8xf32>
  %cst_0 = arith.constant dense<0.000000e+00> : vector<1x8xf32>
  %c32 = arith.constant 32 : index
  %c1 = arith.constant 1 : index
  %c8 = arith.constant 8 : index
  %c0 = arith.constant 0 : index
  %c384 = arith.constant 384 : index
  %c4 = arith.constant 4 : index
  %cst_1 = arith.constant 0.000000e+00 : f32
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %2 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
  %3 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
  scf.for %arg0 = %2 to %c4 step %3 {
    %4 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_id_x]
    %5 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_count_x]
    scf.for %arg1 = %4 to %c384 step %5 {
      %6 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
      %7 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
      %8 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %6) -> (tensor<4x32x384xf32>) {
        %9 = scf.for %arg4 = %c0 to %c32 step %c8 iter_args(%arg5 = %arg3) -> (tensor<4x32x384xf32>) {
          %10 = tensor.empty() : tensor<1x8xf32>
          %11 = vector.transfer_write %cst, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
          %12 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %11) -> (tensor<1x8xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.multi_reduction <maxf>, %16, %17 [2] : vector<1x8x32xf32> to vector<1x8xf32>
            %19 = vector.transfer_write %18, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
            scf.yield %19 : tensor<1x8xf32>
          }
          %13 = vector.transfer_write %cst_0, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
          %14 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %13) -> (tensor<1x8xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
            %19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %20 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %21 = arith.subf %16, %19 : vector<1x8x32xf32>
            %22 = math.exp %21 : vector<1x8x32xf32>
            %23 = vector.multi_reduction <add>, %22, %20 [2] : vector<1x8x32xf32> to vector<1x8xf32>
            %24 = vector.transfer_write %23, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
            scf.yield %24 : tensor<1x8xf32>
          }
          %extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<4x32x384xf32> to tensor<1x8x384xf32>
          %15 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %extracted_slice) -> (tensor<1x8x384xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
            %19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %20 = vector.transfer_read %14[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %21 = vector.broadcast %20 : vector<1x8xf32> to vector<32x1x8xf32>
            %22 = vector.transpose %21, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %23 = arith.subf %16, %19 : vector<1x8x32xf32>
            %24 = math.exp %23 : vector<1x8x32xf32>
            %25 = arith.divf %24, %22 : vector<1x8x32xf32>
            %26 = vector.transfer_write %25, %arg7[%c0, %c0, %arg6] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x8x384xf32>
            scf.yield %26 : tensor<1x8x384xf32>
          }
          %inserted_slice = tensor.insert_slice %15 into %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<1x8x384xf32> into tensor<4x32x384xf32>
          scf.yield %inserted_slice : tensor<4x32x384xf32>
        }
        scf.yield %9 : tensor<4x32x384xf32>
      }
      flow.dispatch.tensor.store %8, %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : tensor<4x32x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
    }
  }
  return
}

I also noticed that IREE didn't recognize the previous incorrect softmax pattern from tensorflow, so compiler didn't generate iree_linalg_ext.softmax and just did normal dispatch formation to fuse those ops (instead of creating 1 big dispatch for the softmax, previously it generated 3 dispatches but also fused with its producer).

Here are the related dispatches generated from the previous softmax pattern (the first arith.addf is the producer of the softmax op).

%69 = flow.dispatch.region[%65, %66, %67, %68] -> (tensor<4x384x1x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_1075, %9 : tensor<4x1x384x384xf32>, tensor<384x1x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
  ^bb0(%in: f32, %in_7714: f32, %out: f32):
    %2567 = arith.addf %in, %in_7714 : f32
    %2568 = math.exp %2567 : f32
    linalg.yield %2568 : f32
  } -> tensor<4x384x1x384xf32>
  flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
  flow.return %x, %y, %z : index, index, index
}
%collapsed_1089 = tensor.collapse_shape %69 [[0], [1], [2, 3]] : tensor<4x384x1x384xf32> into tensor<4x384x384xf32>
%70 = tensor.empty() : tensor<4x384xf32>
%71 = linalg.fill ins(%cst_27 : f32) outs(%70 : tensor<4x384xf32>) -> tensor<4x384xf32>
%c1_1090 = arith.constant 1 : index
%c0_1091 = arith.constant 0 : index
%c4_1092 = arith.constant 4 : index
%c1_1093 = arith.constant 1 : index
%72 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1091, %c4_1092, %c1_1093]
%c0_1094 = arith.constant 0 : index
%c384_1095 = arith.constant 384 : index
%c1_1096 = arith.constant 1 : index
%73 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1094, %c384_1095, %c1_1096]
%c0_1097 = arith.constant 0 : index
%c1_1098 = arith.constant 1 : index
%74 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1097, %c1_1090, %c1_1098]
%75 = flow.dispatch.region[%72, %73, %74] -> (tensor<4x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapsed_1089 : tensor<4x384x384xf32>) outs(%71 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %2567 = arith.addf %in, %out : f32
    linalg.yield %2567 : f32
  } -> tensor<4x384xf32>
  flow.return %2566 : tensor<4x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5
  flow.return %x, %y, %z : index, index, index
}
%c1_1099 = arith.constant 1 : index
%c0_1100 = arith.constant 0 : index
%c4_1101 = arith.constant 4 : index
%c1_1102 = arith.constant 1 : index
%76 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1100, %c4_1101, %c1_1102]
%c0_1103 = arith.constant 0 : index
%c384_1104 = arith.constant 384 : index
%c1_1105 = arith.constant 1 : index
%77 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1103, %c384_1104, %c1_1105]
%c0_1106 = arith.constant 0 : index
%c1_1107 = arith.constant 1 : index
%c1_1108 = arith.constant 1 : index
%78 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1106, %c1_1107, %c1_1108]
%c0_1109 = arith.constant 0 : index
%c384_1110 = arith.constant 384 : index
%c1_1111 = arith.constant 1 : index
%79 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1109, %c384_1110, %c1_1111]
%80 = flow.dispatch.region[%76, %77, %78, %79] -> (tensor<4x384x1x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%69, %75 : tensor<4x384x1x384xf32>, tensor<4x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
  ^bb0(%in: f32, %in_7714: f32, %out: f32):
    %2567 = arith.divf %cst_25, %in_7714 : f32
    %2568 = arith.mulf %in, %2567 : f32
    linalg.yield %2568 : f32
  } -> tensor<4x384x1x384xf32>
  flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
  flow.return %x, %y, %z : index, index, index
}

Is there an easy way we can measure the performance impact of reverting that commit?

Since the x86_64 regression isn't significant, I'll test with the risc-v benchmarks and post the results here.

jpienaar commented 1 year ago

Interesting, so you are saying it is due to the special case being recognized now that there is a regression?

pzread commented 1 year ago

Is there an easy way we can measure the performance impact of reverting that commit?

Tested with riscv. With that tensorflow commit reverted, the latency is same as the run before the integration. So it confirmed that tensorflow commit causes the regression.

MaheshRavishankar commented 1 year ago

@dcaballe or @hcindyl for input. Cindy, I am not sure if this is something you have seen/tracked/are concerned about. The regression on RISC-V seems to be because now softmax is fused into a single dispatch. This is actually the current expected state. Some input on how to handle RISC-V is useful. If such special case matching is not recommended on RISC-V we can think of disabling these for RISC-V. Ideally though, I would put this into the bucket of RISC-V issues.

pzread commented 1 year ago

Interesting, so you are saying it is due to the special case being recognized now that there is a regression?

Not totally. Now I tested with iree-flow-raise-special-ops disabled (so no iree_lingalg_ext.softmax). It only improves the binary size a little (36216 -> 31816 bytes, ~13%), because we still need to generate code for the extra two tosa.reduce_max and tosa.sub ops. The performance also only improves a little (~1.1X faster on RISC-V).

Not sure how much room left to improve but will need some more investigations if we want to push this further.

dcaballe commented 1 year ago

Great findings! Since the x86 regression seems to be justified by the extra computation, I would suggest that we stop here for now and move this to P2. If you are already running on the FPGA, perhaps it's worth that you quickly run perf and share the archived profiles internally. You can also post the reference commit you are using and we can follow up later. Thanks a lot for the analysis!

hcindyl commented 1 year ago

I would defer this to @dcaballe ; however, I am curious to see the perf of other models that contain softmax, such as https://github.com/openxla/iree/blob/ea0b28fe850a06123383cb1cf7671995362c06b1/build_tools/python/e2e_test_framework/models/tflite_models.py#L111-142. Is the hit only in the transformer encoder arch + RISCV (from [the](https://github.com/openxla/iree/issues/12916#issuecomment-1518242653 it seems x86 also takes a hit) ?

allieculp commented 1 year ago

@pzread @dcaballe Any update on this P0?

dcaballe commented 1 year ago

Moving this to P2 as the x86 side is resolved. Only a better understanding of the RISC-V side is pending.

ScottTodd commented 1 month ago

Nothing actionable here so much later.