Need to reduce the device code size for RISC-V quantized vectorized workload

hcindyl commented 2 years ago

Move the discussion for another issue to here so we can have an on-topic discussion.

After the vectorization path of the quantized conv1x1 op was enabled, the library size for the workload (.so or .o) becomes way larger than before. For example, for person detection translated with the static library mode, the resulting .o size goes from 215KB (with candidate-20220225.59) to >1MB

Currently RISC-V codegen shares the same matmul configuration with x86, so there may be some improvement we can do.

To repro:

Download tflite file: https://github.com/google/iree/blob/main/integrations/tensorflow/test/python/iree_tfl_tests/person_detect_test.py#L14

import it into MLIR with iree-import-tosa

iree-import-tosa -o /tmp/tosa.mlir person_detect.tflite

Run the translation targeting on RISC-V:

iree-compile --iree-mlir-to-vm-bytecode-module \ 
--iree-hal-target-backends=dylib-llvm-aot \
--iree-llvm-target-triple=riscv32-pc-linux-elf -iree-llvm-target-cpu=generic-rv32 -iree-llvm-target-abi=ilp32 \
--iree-llvm-target-cpu-features=+m,+a,+f,+zvl512b,+zve32x \
--riscv-v-vector-bits-min=512 --riscv-v-fixed-length-vector-lmul-max=8 \
--iree-llvm-debug-symbols=false \
--iree-llvm-link-embedded=false --iree-llvm-link-static --iree-llvm-system-linker-path=lld  \
--iree-llvm-static-library-output-path=/tmp/a.o
--iree-input-type=tosa /tmp/tosa.mlir -o /tmp/a.vmfb

Inspect the size and content of the static library output of /tmp/a.o

dcaballe commented 2 years ago

Thanks for creating the issue, Cindy!

I discussed this with the team and did some exploration today. Our main suspicion was that the code size issue is due to some kind of unrolling. LLVM's default unrolling has been enabled by default in IREE recently (#8408, thanks for the pointer, @MaheshRavishankar!) but I didn’t see a major difference when turning it off.

When looking at the IR, I found some vector shapes that called my attention: vector<64xi32>, vector<8x32xi32>, vector<16x32xi32>, etc. From an x86 perspective (assuming 512-bit registers), a simple vector<16x32xi32> operation would lower to 16x2=32 512-bit vector instructions, 8x2=16 instructions for vector<8x32xi32> and 4 instructions for vector<64xi32>, respectively. That’s effectively a lot of “unrolling” that is increasing the code size and maybe even leading to register spilling. I guess the problem is similar for RISC-V but I’m not sure if the vector length multiplier makes this situation any better (still catching up with hardware details). Maybe you can clarify.

These large vector shapes whose fastest varying dimensions span across multiple physical vector registers are the result of vectorizing operations with mixed data lengths, like i8 and i32. We recently introduced some great initial logic (#8552) to compute the vector size for these cases. This logic is currently choosing a vector size based on the smallest data type found, which will lead to multi-register vector sizes for larger data types (e.g., vector size=64 for i8/i32 ops, vector size=32 for i16/i32 ops). I've seen this being problematic in the past because it may inadvertently increase the register pressure too much and lead to register spilling. I would suggest that we make the default logic a bit more conservative by computing the vector size based on the largest data type instead of the smallest one. We may leave some performance on the table initially but we can refine this logic incrementally by:

enabling more aggressive vector sizes only for specific ops (i.e., i8/i32 matmuls),
taking register pressure into account,
making sure that no other unrolling happens if an aggressive vector size is used.

I run some experiments along these lines:

Version	Code Size
Current baseline	2.0 MB
Disabling LLVM’s unrolling	1.9MB
Computing vector size using the smallest data type	580 KB
+ disabling LLVM’s unrolling	540 KB

Please, let me know what you think and I'll send an initial PR if this makes sense to you. Thanks

hcindyl commented 2 years ago

Sounds good, we can examine the generated workload and discuss the next steps.

For RISCV v-ext, in the configuration above the vector register is also defined as 512 bits

--iree-llvm-target-cpu-features=+m,+a,+f,+zvl512b,+zve32x \
--riscv-v-vector-bits-min=512 --riscv-v-fixed-length-vector-lmul-max=8 \

This translates to the system having 32 vector registers (set in RVV spec) with 512 bits in length for each register, and the length multiplier can be up to 8, to represent 4 groups of vector registers of 4096 bits

dcaballe commented 2 years ago

Sounds good, we can examine the generated workload and discuss the next steps.

I created #8883 for you to give it a try.

4 groups of vector registers of 4096 bits

Do registers need to be consecutive within a group? That would make the problem even worse than for x86 register-wise but I at least you can handle it with just one instruction, I guess.

hcindyl commented 2 years ago

With LMUL>1, the group of registers are consecutive. For example, LMUL=8 you have 4 groups that are only accessible with v0, v8, v16, v24. A simple register initialization code is like this

vsetvli t0, zero, e8, m8, tu, mu
vmv.v.i v0, 0
vmv.v.i v8, 0
vmv.v.i v16, 0
vmv.v.i v24, 0

However, IIUC there are some additional LLVM options you need set to utilize LMUL >1 properly, otherwise in most cases it is just m1 even though the flag --riscv-v-fixed-length-vector-lmul-max=8 is set.

hcindyl commented 2 years ago

We have another regression of the code size since https://github.com/google/iree/releases/tag/candidate-20220418.111, possibly due to https://github.com/google/iree/pull/8723. The code size for person_detect grew to 2.5MB. :(

MaheshRavishankar commented 2 years ago

oh shucks.. well, I dont know what the root cause is. The cause might be "more fusion" at that level now leads to more vector instructions being generated for a dispatch....

dcaballe commented 2 years ago

Yep. I've been working on an heuristic that reduced the code size from 2MB to 500KB. Unfortunately, I see that with the latest version of IREE the baseline is 3.9MB and it goes down to only 3.2MB with the heuristic. Looking into it.

What are you measuring exactly, @hcindyl? I'm just looking at the size of the whole .o file.

benvanik commented 2 years ago

You'll want to look at the .so - the .o's we emit aren't stripped (IIRC). e.g.

hcindyl commented 2 years ago

We measured the size during the link time while we link the static library with the runtime (similar to iree/samples/static_library and set the size alarm inside the linker script. If you want to double check the embedded .so size, you can update the CLI with

iree-compile --iree-mlir-to-vm-bytecode-module \ 
--iree-hal-target-backends=dylib-llvm-aot \
--iree-llvm-target-triple=riscv32-pc-linux-elf -iree-llvm-target-cpu=generic-rv32 -iree-llvm-target-abi=ilp32 \
--iree-llvm-target-cpu-features=+m,+a,+f,+zvl512b,+zve32x \
--riscv-v-vector-bits-min=512 --riscv-v-fixed-length-vector-lmul-max=8 \
--iree-llvm-debug-symbols=false \
--iree-llvm-link-embedded=true --iree-llvm-embedded-linker-path=lld  \
--iree-llvm-keep-linker-artifacts \
--iree-input-type=tosa /tmp/tosa.mlir -o /tmp/a.vmfb

MaheshRavishankar commented 2 years ago

Yep. I've been working on an heuristic that reduced the code size from 2MB to 500KB. Unfortunately, I see that with the latest version of IREE the baseline is 3.9MB and it goes down to only 3.2MB with the heuristic. Looking into it.

What are you measuring exactly, @hcindyl? I'm just looking at the size of the whole .o file.

Could you post some info of the dispatches that are creating the issue here....

dcaballe commented 2 years ago

After looking at the .so file, the numbers make more sense. Thanks! Before the latest fusion changes, the baseline code size was 1.3MB and it went down to 250 KB with the initial version of the heuristic I've been working on. After the fusion changes, the baseline code size is 2.2MB and it goes down to only 1.8MB with the heuristic :(.

The latest code size increase happened because recent changes in the fusion algorithm let us fuse matmul ops with subsequent element-wise ops more aggressively. For example, in the snippet below, you can see how ops %28-%34 are fused with the producer matmul and vectorized using the matmul vector size vector<8x16xi32>. Without the recent fusion changes, they were not fused and were vectorized using vector<1x16xi32>, resulting in 8x less unrolling.

  scf.for %arg0 = %c0 to %c64 step %c8 {                                                                                                                                                                                                           
    %23 = vector.transfer_read %9[%arg0, %c0], %c0_i8 {in_bounds = [true, true]} : memref<64x8xi8, affine_map<(d0, d1)[s0] -> (d0 * 8 + s0 + d1)>>, vector<8x8xi8>                                                                                 
    %24 = arith.extsi %23 : vector<8x8xi8> to vector<8x8xi32>                                                                                                                                                                                      
    %25 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %24, %13, %cst : vector<8x8xi32>, vector<8x16xi32> into vector<8x16xi32>                                                                                                                                                                             
    %26 = arith.subi %25, %22 : vector<8x16xi32>                                                                                                                                                                                                   
    %27 = arith.addi %15, %26 : vector<8x16xi32>                                                                                                                                                                                                   
    %28 = "tosa.apply_scale"(%27, %19, %21) {double_round = true} : (vector<8x16xi32>, vector<8x16xi32>, vector<8x16xi8>) -> vector<8x16xi32>                                                                                                      
    %29 = arith.addi %28, %cst_0 : vector<8x16xi32>                                                                                                                                                                                                
    %30 = arith.cmpi slt, %29, %cst_0 : vector<8x16xi32>                                                                                                                                                                                           
    %31 = arith.select %30, %cst_0, %29 : vector<8x16xi1>, vector<8x16xi32>                                                                                                                                                                        
    %32 = arith.cmpi slt, %cst_1, %29 : vector<8x16xi32>                                                                                                                                                                                           
    %33 = arith.select %32, %cst_1, %31 : vector<8x16xi1>, vector<8x16xi32>                                                                                                                                                                        
    %34 = arith.trunci %33 : vector<8x16xi32> to vector<8x16xi8>                                                                                                                                                                                   
    vector.transfer_write %34, %10[%arg0, %c0] {in_bounds = [true, true]} : vector<8x16xi8>, memref<64x16xi8, affine_map<(d0, d1)[s0] -> (d0 * 16 + s0 + d1)>>                                                                                     
  }

We can probably hack something for this specific case but this doesn’t scale well to future issues. I see this becoming a recurrent source of problems and there is no simple solution. Any small change or new optimization in the pipeline will likely have an impact in code size that might be pretty significant and not necessarily a problem for all the targets. To make an informed decision on how to proceed, we should know what the specific code size requirements are (i.e., what the acceptable code size upper bound is), how important code size is vs performance, how critical the problem is for the project (e.g., could we modify the simulator parameters to accommodate a larger binary?) and if these requirements are temporary (i.e., code size will be less of a problem after a certain development stage) or permanent.

We talked about introducing optimization levels in the compiler, where one of them would optimize for code size. I think in the short term we could only provide an optimization level that strictly optimizes for code size (equivalent to -Oz, disabling all kinds of unrolling and other optimizations that could increase code size), but that would have a severe impact on performance.

What are your thoughts?

benvanik commented 2 years ago

One thing that would be good to analyze is why there is so much code - this seems to indicate there are confounding variables. We should have smaller binaries with more fusion if things were operating correctly as we are reducing all of the boilerplate that would have been around each dispatch. If authoring by hand we should be seeing a few dozen extra instructions if these were unrolled totaling only a small N KB increase.

(I don't think this is a new issue - we've always generated way too much code - but most people have strictly cared about execution performance at the cost of code size and I believe there's some larger issues than just the tuning parameters can solve)

benvanik commented 2 years ago

(e.g. in the snippet above all those trailing select/cmp/etc ops should be a single SIMD instruction when lowered, but I suspect they are many orders of magnitude more than that :)

MaheshRavishankar commented 2 years ago

Agreed with Ben on this, we probably are hitting the limits of what is currently implemented in MLIR w.r.t scalar folding. Also the lowering we use for the CPU codegen has been developed primarily with x86 performance in mind, where codesize is not an issue. Optimization flags can only mask the deficiencies, but we need to make sure either vector dialect does a better lowering to LLVM (or figure our if the LLVM IR lowered to by MLIR isnt getting peephole optimized enough within LLVM). We are fusing more for sure, but we also not fusing that much that it leads to several MB of code. The main pain point I see here is the lowering of vector operations to LLVM IR. It has pretty bad code size and compilation time implications.

dcaballe commented 2 years ago

I can definitely take a deeper look! Note that there are no other differences at MLIR level other than the fusion of contract+ trailing ops. There are quite a few instances of them so they are probably dominating the code size. Some instances look even worse than the one above (e.g., trailing ops go from vector<1x16xi32> to vector<12x32xi32>) and I do see register spilling, which is another multiplier (note that some contract ops effectively unrolls the matmul by a factor of 64 already!). Let's see what else I find :)

Quick and unrelated question: Why is tosa.apply_scale not lowered until we lower to the LLVM dialect? We are missing a few optimizations at MLIR level because of that.

benvanik commented 2 years ago

Very excited you're looking into this! The amount of potential performance we're sitting on is quite large :)

@rsuderman has the history; I believe it's because the operation it's performing is not easily expanded with the currently available ops - it makes things a bajillion times worse :)

Here's a related issue: https://github.com/google/iree/issues/7781

You can see what it expands to in the current world here: https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/TosaToArith/tosa-to-arith.mlir#L15-L49

The reference for it is in the TOSA spec section 1.5.2: https://developer.mlplatform.org/file/download/tywkdgcadici74tnd67t/PHID-FILE-ygtmfyvgclfmd5vetehh/tosa_spec.pdf (apply_scale_32 / apply_scale_16).

It really calls for a small number of ops in arith/math/whatever that model the behavior without needing to expand it to that form (and require the 64-bit math to be explicitly modeled). It's kind of like why we have an FMA instead of just mul+add. The examples above with the cmp+selects are similar expansions with no ops to contract into (clamp/saturate/etc).

benvanik commented 2 years ago

there's some orthogonal design work specifically for code size that is also critical to offsetting bloat: executables in IREE can have an arbitrary number of functions and we link them together and deduplicate but today we are effectively generating each dispatch function independently in its own hal.executable and fully inlining everything. Ideally we'd be able to extract common code segments out into functions and call them inside our inner loops to operate on registers (we're not limited by any standard calling conventions here as we're only calling amongst our own code).

As an example say there's an 8x or 16x or ... unrolled sequence of instructions for doing something like the apply_scale: we could generate an internal function that does that, emit a call, and then share that across all dispatches that perform the apply_scale. That's effectively what most libraries do: https://github.com/ARM-software/ComputeLibrary/blob/7dcb9fadb98cad05fca72de3273311d570d98b4e/src/core/NEON/kernels/arm_gemm/quantized.cpp#L59

This would let us take any innermost loop operating on fixed sizes and share it among all dispatches that use it - there's a much smaller number of those vs the total number of unique dispatches post-fusion (matmul + add and matmul + mul could share the same matmul section). The goal of fusing dispatches is to keep things in registers and avoid round-tripping into memory between operations and there are lots of ways to do that besides fully inlining everything.

Both are really exciting and meaty areas with lots of potential :)

MaheshRavishankar commented 2 years ago

I can definitely take a deeper look! Note that there are no other differences at MLIR level other than the fusion of concat + trailing ops. There are quite a few instances of them so they are probably dominating the code size. Some instances look even worse than the one above (e.g., trailing ops go from vector<1x16xi32> to vector<12x32xi32>) and I do see register spilling, which is another multiplier (note that some concat ops are unrolled by a factor of 64 already!). Let's see what else I find :)

Quick and unrelated question: Why is tosa.apply_scale not lowered until we lower to the LLVM dialect? We are missing a few optimizations at MLIR level because of that.

I am missing the context on fusion of concat + trailing ops. Do you have a snippet of the IR that shows a little bit more context on whats happening here.

dcaballe commented 2 years ago

I am missing the context on fusion of concat + trailing ops. Do you have a snippet of the IR that shows a little bit more context on whats happening here.

My bad... I meant "contract" not "concat" :). Contract + trailing ops like the snippet I posted above. Let me know if it's still not clear and I'll clarify further :)

dcaballe commented 2 years ago

I spent some time looking at the LLVM-IR right before going into the RISC-V backend and the generated assembly. Paradoxically, the major code size problem is related to the issue #7781 that @benvanik mentioned above. We generate i64 instructions for some tosa ops like tosa.apply_scale. For example:

    %79 = sext <16 x i8> %78 to <16 x i64>, !dbg !90                                                                                                                                                                                                 
    %80 = shl <16 x i64> <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>, %79, !dbg !90                                                                                             
    %81 = sext <16 x i8> %77 to <16 x i32>, !dbg !90                                                                                                                                                                                                 
    %82 = add <16 x i64> %80, <i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824>, !dbg !90                                                                                                                                                                                                
    %83 = sub <16 x i64> %80, <i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824, i64 1073741824>, !dbg !90                                                                                                                                                                                                
    %84 = icmp sge <16 x i32> %76, zeroinitializer, !dbg !90                                                                                                                                                                                         
    %85 = select <16 x i1> %84, <16 x i64> %82, <16 x i64> %83, !dbg !90                                                                                                                                                                             
    %86 = icmp sge <16 x i32> %81, <i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32, i32 32>, !dbg !90                                                                        
    %87 = select <16 x i1> %86, <16 x i64> %85, <16 x i64> %80, !dbg !90                                                                                                                                                                             
    %88 = sext <16 x i32> %76 to <16 x i64>, !dbg !90                                                                                                                                                                                                
    %89 = extractvalue [1 x [1 x [1 x <16 x i32>]]] %67, 0, 0, 0, !dbg !90                                                                                                                                                                           
    %90 = sext <16 x i32> %89 to <16 x i64>, !dbg !90                                                                                                                                                                                                
    %91 = sext <16 x i8> %77 to <16 x i64>, !dbg !90                                                                                                                                                                                                 
    %92 = mul <16 x i64> %88, %90, !dbg !90                                                                                                                                                                                                          
    %93 = add <16 x i64> %92, %87, !dbg !90                                                                                                                                                                                                          
    %94 = ashr <16 x i64> %93, %91, !dbg !90                                                                                                                                                                                                         
    %95 = trunc <16 x i64> %94 to <16 x i32>, !dbg !90

I64 vectors are not supported in the RISC-V implementation that we use so every single i64 element in a vector is scalarized and going through memory between operations. Taking into account that we sometimes unroll this code by a factor of 64… You can imagine the mess :). This is for sure also penalizing compile time.

If we add the ‘+zve64x’ feature to support i64 vectors, the code size goes from 2.2MB to 186KB :). We have to address #7781 before we can explore anything else. @benvanik, @MaheshRavishankar could we prioritize that work? Do we have an estimate?

MaheshRavishankar commented 2 years ago

Wow thats a great find. @rsuderman will have more context on that. Lets sync up when we are all back to see what the way forward here is.

benvanik commented 2 years ago

mmmm an actual order of magnitude - great investigation :)

dcaballe commented 2 years ago

@rsuderman has been working on a fix to get rid of the i64 instructions related to tosa.apply_scale: https://reviews.llvm.org/D125583. The patch is still WIP but in a preliminary evaluation the code size goes down to 388 KB!

rsuderman commented 2 years ago

@hcindyl @dcaballe I believe we can mark this as fixed correct as it was submitted in

https://reviews.llvm.org/D125583

dcaballe commented 2 years ago

I still have to enable it for RISC-V based on specific hardware features. It's in my TODO list but I haven't been able to get at it yet.

dcaballe commented 2 years ago

https://github.com/google/iree/pull/9407 added the final bits to fix this issue.

Thank you all!

iree-org / iree

Need to reduce the device code size for RISC-V quantized vectorized workload #8849