weird IR generation on RISCV

guoyuqi020 commented 2 months ago

I'm a researcher studying the difference between clang's IR generation on x86-64 and RISCV64. Recently, I've been trying the brotli-1.0.9 package. I found that on RISCV64, clang can generate weird complex IR for simple load-and-store logic.

For the same basic block, the IR generated on x86-64 is like this:

if.else853:                                       ; preds = %while.body
  tail call void @llvm.dbg.declare(metadata ptr %is_all_caps, metadata !2178, metadata !DIExpression()), !dbg !2180
  %transform854 = getelementptr inbounds %struct.DictWord, ptr %w, i32 0, i32 1, !dbg !2181
  %1133 = load i8, ptr %transform854, align 1, !dbg !2181
  %conv855 = zext i8 %1133 to i32, !dbg !2181
  %cmp856 = icmp ne i32 %conv855, 10, !dbg !2181
  %lnot858 = xor i1 %cmp856, true, !dbg !2181
  %lnot860 = xor i1 %lnot858, true, !dbg !2181
  %1134 = zext i1 %lnot860 to i64, !dbg !2181
  %cond = select i1 %lnot860, i32 1, i32 0, !dbg !2181
  store i32 %cond, ptr %is_all_caps, align 4, !dbg !2180
  tail call void @llvm.dbg.declare(metadata ptr %s862, metadata !2182, metadata !DIExpression()), !dbg !2183
  %1135 = load ptr, ptr %dictionary.addr, align 8, !dbg !2184
  %words863 = getelementptr inbounds %struct.BrotliEncoderDictionary, ptr %1135, i32 0, i32 0, !dbg !2186
  %1136 = load ptr, ptr %words863, align 8, !dbg !2186
  %1137 = load ptr, ptr %data.addr, align 8, !dbg !2187
  %1138 = load i64, ptr %max_length.addr, align 8, !dbg !2188
  %1139 = load i32, ptr %w, align 2, !dbg !2189
  store i32 %1139, ptr %w.i2875, align 2
  store ptr %1136, ptr %dictionary.addr.i2876, align 8
  tail call void @llvm.dbg.declare(metadata ptr %dictionary.addr.i2876, metadata !2190, metadata !DIExpression()), !dbg !2194
  tail call void @llvm.dbg.declare(metadata ptr %w.i2875, metadata !2196, metadata !DIExpression()), !dbg !2197
  store ptr %1137, ptr %data.addr.i2877, align 8
  tail call void @llvm.dbg.declare(metadata ptr %data.addr.i2877, metadata !2198, metadata !DIExpression()), !dbg !2199
  store i64 %1138, ptr %max_length.addr.i2878, align 8
  tail call void @llvm.dbg.declare(metadata ptr %max_length.addr.i2878, metadata !2200, metadata !DIExpression()), !dbg !2201
  %1140 = load i8, ptr %w.i2875, align 2, !dbg !2202
  %conv.i2882 = zext i8 %1140 to i64, !dbg !2204
  %1141 = load i64, ptr %max_length.addr.i2878, align 8, !dbg !2205
  %cmp.i2883 = icmp ugt i64 %conv.i2882, %1141, !dbg !2206
  br i1 %cmp.i2883, label %if.then.i2967, label %if.else.i2884, !dbg !2207

On RISCV64, it's like this:

if.else853:                                       ; preds = %while.body
  tail call void @llvm.dbg.declare(metadata ptr %is_all_caps, metadata !2372, metadata !DIExpression()), !dbg !2374
  %transform854 = getelementptr inbounds %struct.DictWord, ptr %w, i32 0, i32 1, !dbg !2375
  %1127 = load i8, ptr %transform854, align 1, !dbg !2375
  %conv855 = zext i8 %1127 to i32, !dbg !2375
  %cmp856 = icmp ne i32 %conv855, 10, !dbg !2375
  %lnot858 = xor i1 %cmp856, true, !dbg !2375
  %lnot860 = xor i1 %lnot858, true, !dbg !2375
  %1128 = zext i1 %lnot860 to i64, !dbg !2375
  %cond = select i1 %lnot860, i32 1, i32 0, !dbg !2375
  store i32 %cond, ptr %is_all_caps, align 4, !dbg !2374
  tail call void @llvm.dbg.declare(metadata ptr %s862, metadata !2376, metadata !DIExpression()), !dbg !2377
  %1129 = load ptr, ptr %dictionary.addr, align 8, !dbg !2378
  %words863 = getelementptr inbounds %struct.BrotliEncoderDictionary, ptr %1129, i32 0, i32 0, !dbg !2380
  %1130 = load ptr, ptr %words863, align 8, !dbg !2380
  %1131 = load ptr, ptr %data.addr, align 8, !dbg !2381
  %1132 = load i64, ptr %max_length.addr, align 8, !dbg !2382
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %w.coerce, ptr align 2 %w, i64 4, i1 false), !dbg !2383
  %1133 = load i64, ptr %w.coerce, align 8, !dbg !2383
  store i64 %1133, ptr %tmp.coerce.i2880, align 8
  call void @llvm.memcpy.p0.p0.i64(ptr align 2 %w.i2879, ptr align 8 %tmp.coerce.i2880, i64 4, i1 false)
  store ptr %1130, ptr %dictionary.addr.i2881, align 8
  tail call void @llvm.dbg.declare(metadata ptr %dictionary.addr.i2881, metadata !2384, metadata !DIExpression()), !dbg !2388
  tail call void @llvm.dbg.declare(metadata ptr %w.i2879, metadata !2390, metadata !DIExpression()), !dbg !2391
  store ptr %1131, ptr %data.addr.i2882, align 8
  tail call void @llvm.dbg.declare(metadata ptr %data.addr.i2882, metadata !2392, metadata !DIExpression()), !dbg !2393
  store i64 %1132, ptr %max_length.addr.i2883, align 8
  tail call void @llvm.dbg.declare(metadata ptr %max_length.addr.i2883, metadata !2394, metadata !DIExpression()), !dbg !2395
  %1134 = load i8, ptr %w.i2879, align 2, !dbg !2396
  %conv.i2887 = zext i8 %1134 to i64, !dbg !2398
  %1135 = load i64, ptr %max_length.addr.i2883, align 8, !dbg !2399
  %cmp.i2888 = icmp ugt i64 %conv.i2887, %1135, !dbg !2400
  br i1 %cmp.i2888, label %if.then.i2972, label %if.else.i2889, !dbg !2401

Most of the code is the same. The difference exists in a small load-and-store logic. On x86-64, I have:

  %1139 = load i32, ptr %w, align 2, !dbg !2189
  store i32 %1139, ptr %w.i2875, align 2

But on RISCV64, I have:

  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %w.coerce, ptr align 2 %w, i64 4, i1 false), !dbg !2383
  %1133 = load i64, ptr %w.coerce, align 8, !dbg !2383
  store i64 %1133, ptr %tmp.coerce.i2880, align 8
  call void @llvm.memcpy.p0.p0.i64(ptr align 2 %w.i2879, ptr align 8 %tmp.coerce.i2880, i64 4, i1 false)

Note that %w points to a struct with 2-byte aligned:

  %w = alloca %struct.DictWord, align 2

And %struct.DictWord is a 32-bit struct:

  %struct.DictWord = type { i8, i8, i16 }

These two slices of code do the same thing. They load an i32 from %w and then store it in another value. The problem is, since this is an i32 value, why RISCV64 bothers to use two i64 variables and two memcpy calls to do such simple value transmission? Do you have any ideas?

======================================================

I have uploaded the source code of the package brotli-1.0.9. brotli-1.0.9.zip

The instructions I used to generate IR:

cd brotli-1.0.9
cd c
clang -g -fno-discard-value-names -S -emit-llvm ./enc/static_dict.c -I./include -o ~/x86.ll

I uploaded the IR generated on my machines. Hope they may help. IR.zip

I'm using commit 132bf4aedd678277b57d8e2bdabf9a1e9eb254c5 of LLVM.

If you need any other information, please let me know.

topperc commented 2 months ago

The x86.ll seems to be from brotli.1.0.9 and the riscv.ll is from brotli.1.1.0. The riscv.ll contains a function called BrotliFindAllStaticDictionaryMatchesFor that does not exist in x86.ll. The %1133 variable you show from riscv.ll is in this function so I can't be sure I can compare the two .ll files.

guoyuqi020 commented 2 months ago

Sorry, that's my mistake.

I have tried Brotli 1.0.9 on RISCV. The same thing happened.

The basic block on RISCV, brotli 1.0.9

if.else853:                                       ; preds = %while.body
  tail call void @llvm.dbg.declare(metadata ptr %is_all_caps, metadata !2182, metadata !DIExpression()), !dbg !2184
  %transform854 = getelementptr inbounds %struct.DictWord, ptr %w, i32 0, i32 1, !dbg !2185
  %1133 = load i8, ptr %transform854, align 1, !dbg !2185
  %conv855 = zext i8 %1133 to i32, !dbg !2185
  %cmp856 = icmp ne i32 %conv855, 10, !dbg !2185
  %lnot858 = xor i1 %cmp856, true, !dbg !2185
  %lnot860 = xor i1 %lnot858, true, !dbg !2185
  %1134 = zext i1 %lnot860 to i64, !dbg !2185
  %cond = select i1 %lnot860, i32 1, i32 0, !dbg !2185
  store i32 %cond, ptr %is_all_caps, align 4, !dbg !2184
  tail call void @llvm.dbg.declare(metadata ptr %s862, metadata !2186, metadata !DIExpression()), !dbg !2187
  %1135 = load ptr, ptr %dictionary.addr, align 8, !dbg !2188
  %words863 = getelementptr inbounds %struct.BrotliEncoderDictionary, ptr %1135, i32 0, i32 0, !dbg !2190
  %1136 = load ptr, ptr %words863, align 8, !dbg !2190
  %1137 = load ptr, ptr %data.addr, align 8, !dbg !2191
  %1138 = load i64, ptr %max_length.addr, align 8, !dbg !2192
  call void @llvm.memcpy.p0.p0.i64(ptr align 8 %w.coerce, ptr align 2 %w, i64 4, i1 false), !dbg !2193
  %1139 = load i64, ptr %w.coerce, align 8, !dbg !2193
  store i64 %1139, ptr %tmp.coerce.i2880, align 8
  call void @llvm.memcpy.p0.p0.i64(ptr align 2 %w.i2879, ptr align 8 %tmp.coerce.i2880, i64 4, i1 false)
  store ptr %1136, ptr %dictionary.addr.i2881, align 8
  tail call void @llvm.dbg.declare(metadata ptr %dictionary.addr.i2881, metadata !2194, metadata !DIExpression()), !dbg !2198
  tail call void @llvm.dbg.declare(metadata ptr %w.i2879, metadata !2200, metadata !DIExpression()), !dbg !2201
  store ptr %1137, ptr %data.addr.i2882, align 8
  tail call void @llvm.dbg.declare(metadata ptr %data.addr.i2882, metadata !2202, metadata !DIExpression()), !dbg !2203
  store i64 %1138, ptr %max_length.addr.i2883, align 8
  tail call void @llvm.dbg.declare(metadata ptr %max_length.addr.i2883, metadata !2204, metadata !DIExpression()), !dbg !2205
  %1140 = load i8, ptr %w.i2879, align 2, !dbg !2206
  %conv.i2887 = zext i8 %1140 to i64, !dbg !2208
  %1141 = load i64, ptr %max_length.addr.i2883, align 8, !dbg !2209
  %cmp.i2888 = icmp ugt i64 %conv.i2887, %1141, !dbg !2210
  br i1 %cmp.i2888, label %if.then.i2972, label %if.else.i2889, !dbg !2211

The new IR. I updated the riscv.ll. IR.zip

topperc commented 2 months ago

I think what is happening is that clang is coercing the DictWord struct into an XLen sized integer type in RISCVABIInfo::classifyArgumentType. This is for the call to IsMatch in the brotli source.

Specifically this code

  // Aggregates which are <= 2*XLen will be passed in registers if possible,     
  // so coerce to integers.                                                      
  if (Size <= 2 * XLen) {                                                        
    unsigned Alignment = getContext().getTypeAlign(Ty);                          

    // Use a single XLen int if possible, 2*XLen if 2*XLen alignment is          
    // required, and a 2-element XLen array if only XLen alignment is required.  
    if (Size <= XLen) {                                                          
      return ABIArgInfo::getDirect(                                              
          llvm::IntegerType::get(getVMContext(), XLen));                         
    } else if (Alignment == 2 * XLen) {                                          
      return ABIArgInfo::getDirect(                                              
          llvm::IntegerType::get(getVMContext(), 2 * XLen));                     
    } else {                                                                     
      return ABIArgInfo::getDirect(llvm::ArrayType::get(                         
          llvm::IntegerType::get(getVMContext(), XLen), 2));                     
    }                                                                            
  }

This creates the memcpy. After that the IsMatch function is inlined, but no other optimizations are done to cleanup the memcpy since you compiled without optimizations.

X86 on the other hand passes the struct in an i32 value which didn't require the memcpy.

The handling of ABI is weirdly divided between clang and the backend. The ABI says a small struct is passed packed in a single integer register. Since only clang knows that it is a C struct, clang is responsible for coercing the struct to an integer type. I'm not sure we need to use XLen for this integer type. Using i32 would probably still work with the backend.

@asb @jrtc27 @kito-cheng

llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Yuqi Guo (guoyuqi020)

I'm a researcher studying the difference between clang's IR generation on x86-64 and RISCV64. Recently, I've been trying the `brotli-1.0.9` package. I found that on RISCV64, clang can generate weird complex IR for simple load-and-store logic. For the same basic block, the IR generated on x86-64 is like this: ``` if.else853: ; preds = %while.body tail call void @llvm.dbg.declare(metadata ptr %is_all_caps, metadata !2178, metadata !DIExpression()), !dbg !2180 %transform854 = getelementptr inbounds %struct.DictWord, ptr %w, i32 0, i32 1, !dbg !2181 %1133 = load i8, ptr %transform854, align 1, !dbg !2181 %conv855 = zext i8 %1133 to i32, !dbg !2181 %cmp856 = icmp ne i32 %conv855, 10, !dbg !2181 %lnot858 = xor i1 %cmp856, true, !dbg !2181 %lnot860 = xor i1 %lnot858, true, !dbg !2181 %1134 = zext i1 %lnot860 to i64, !dbg !2181 %cond = select i1 %lnot860, i32 1, i32 0, !dbg !2181 store i32 %cond, ptr %is_all_caps, align 4, !dbg !2180 tail call void @llvm.dbg.declare(metadata ptr %s862, metadata !2182, metadata !DIExpression()), !dbg !2183 %1135 = load ptr, ptr %dictionary.addr, align 8, !dbg !2184 %words863 = getelementptr inbounds %struct.BrotliEncoderDictionary, ptr %1135, i32 0, i32 0, !dbg !2186 %1136 = load ptr, ptr %words863, align 8, !dbg !2186 %1137 = load ptr, ptr %data.addr, align 8, !dbg !2187 %1138 = load i64, ptr %max_length.addr, align 8, !dbg !2188 %1139 = load i32, ptr %w, align 2, !dbg !2189 store i32 %1139, ptr %w.i2875, align 2 store ptr %1136, ptr %dictionary.addr.i2876, align 8 tail call void @llvm.dbg.declare(metadata ptr %dictionary.addr.i2876, metadata !2190, metadata !DIExpression()), !dbg !2194 tail call void @llvm.dbg.declare(metadata ptr %w.i2875, metadata !2196, metadata !DIExpression()), !dbg !2197 store ptr %1137, ptr %data.addr.i2877, align 8 tail call void @llvm.dbg.declare(metadata ptr %data.addr.i2877, metadata !2198, metadata !DIExpression()), !dbg !2199 store i64 %1138, ptr %max_length.addr.i2878, align 8 tail call void @llvm.dbg.declare(metadata ptr %max_length.addr.i2878, metadata !2200, metadata !DIExpression()), !dbg !2201 %1140 = load i8, ptr %w.i2875, align 2, !dbg !2202 %conv.i2882 = zext i8 %1140 to i64, !dbg !2204 %1141 = load i64, ptr %max_length.addr.i2878, align 8, !dbg !2205 %cmp.i2883 = icmp ugt i64 %conv.i2882, %1141, !dbg !2206 br i1 %cmp.i2883, label %if.then.i2967, label %if.else.i2884, !dbg !2207 ``` On RISCV64, it's like this: ``` if.else853: ; preds = %while.body tail call void @llvm.dbg.declare(metadata ptr %is_all_caps, metadata !2372, metadata !DIExpression()), !dbg !2374 %transform854 = getelementptr inbounds %struct.DictWord, ptr %w, i32 0, i32 1, !dbg !2375 %1127 = load i8, ptr %transform854, align 1, !dbg !2375 %conv855 = zext i8 %1127 to i32, !dbg !2375 %cmp856 = icmp ne i32 %conv855, 10, !dbg !2375 %lnot858 = xor i1 %cmp856, true, !dbg !2375 %lnot860 = xor i1 %lnot858, true, !dbg !2375 %1128 = zext i1 %lnot860 to i64, !dbg !2375 %cond = select i1 %lnot860, i32 1, i32 0, !dbg !2375 store i32 %cond, ptr %is_all_caps, align 4, !dbg !2374 tail call void @llvm.dbg.declare(metadata ptr %s862, metadata !2376, metadata !DIExpression()), !dbg !2377 %1129 = load ptr, ptr %dictionary.addr, align 8, !dbg !2378 %words863 = getelementptr inbounds %struct.BrotliEncoderDictionary, ptr %1129, i32 0, i32 0, !dbg !2380 %1130 = load ptr, ptr %words863, align 8, !dbg !2380 %1131 = load ptr, ptr %data.addr, align 8, !dbg !2381 %1132 = load i64, ptr %max_length.addr, align 8, !dbg !2382 call void @llvm.memcpy.p0.p0.i64(ptr align 8 %w.coerce, ptr align 2 %w, i64 4, i1 false), !dbg !2383 %1133 = load i64, ptr %w.coerce, align 8, !dbg !2383 store i64 %1133, ptr %tmp.coerce.i2880, align 8 call void @llvm.memcpy.p0.p0.i64(ptr align 2 %w.i2879, ptr align 8 %tmp.coerce.i2880, i64 4, i1 false) store ptr %1130, ptr %dictionary.addr.i2881, align 8 tail call void @llvm.dbg.declare(metadata ptr %dictionary.addr.i2881, metadata !2384, metadata !DIExpression()), !dbg !2388 tail call void @llvm.dbg.declare(metadata ptr %w.i2879, metadata !2390, metadata !DIExpression()), !dbg !2391 store ptr %1131, ptr %data.addr.i2882, align 8 tail call void @llvm.dbg.declare(metadata ptr %data.addr.i2882, metadata !2392, metadata !DIExpression()), !dbg !2393 store i64 %1132, ptr %max_length.addr.i2883, align 8 tail call void @llvm.dbg.declare(metadata ptr %max_length.addr.i2883, metadata !2394, metadata !DIExpression()), !dbg !2395 %1134 = load i8, ptr %w.i2879, align 2, !dbg !2396 %conv.i2887 = zext i8 %1134 to i64, !dbg !2398 %1135 = load i64, ptr %max_length.addr.i2883, align 8, !dbg !2399 %cmp.i2888 = icmp ugt i64 %conv.i2887, %1135, !dbg !2400 br i1 %cmp.i2888, label %if.then.i2972, label %if.else.i2889, !dbg !2401 ``` Most of the code is the same. The difference exists in a small load-and-store logic. On x86-64, I have: ``` %1139 = load i32, ptr %w, align 2, !dbg !2189 store i32 %1139, ptr %w.i2875, align 2 ``` But on RISCV64, I have: ``` call void @llvm.memcpy.p0.p0.i64(ptr align 8 %w.coerce, ptr align 2 %w, i64 4, i1 false), !dbg !2383 %1133 = load i64, ptr %w.coerce, align 8, !dbg !2383 store i64 %1133, ptr %tmp.coerce.i2880, align 8 call void @llvm.memcpy.p0.p0.i64(ptr align 2 %w.i2879, ptr align 8 %tmp.coerce.i2880, i64 4, i1 false) ``` Note that `%w` points to a struct with 2-byte aligned: ``` %w = alloca %struct.DictWord, align 2 ``` And `%struct.DictWord` is a 32-bit struct: ``` %struct.DictWord = type { i8, i8, i16 } ``` These two slices of code do the same thing. They load an `i32` from `%w` and then store it in another value. The problem is, since this is an `i32` value, why RISCV64 bothers to use two `i64` variables and two `memcpy` calls to do such simple value transmission? Do you have any ideas? ====================================================== I have uploaded the source code of the package `brotli-1.0.9`. [brotli-1.0.9.zip](https://github.com/llvm/llvm-project/files/15087235/brotli-1.0.9.zip) The instructions I used to generate IR: ``` cd brotli-1.0.9 cd c clang -g -fno-discard-value-names -S -emit-llvm ./enc/static_dict.c -I./include -o ~/x86.ll ``` I uploaded the IR generated on my machines. Hope they may help. [IR.zip](https://github.com/llvm/llvm-project/files/15087271/IR.zip) I'm using commit `132bf4aedd678277b57d8e2bdabf9a1e9eb254c5` of LLVM. If you need any other information, please let me know.

jrtc27 commented 2 months ago

Does it really matter? This is the kind of thing that SROA can optimise.

guoyuqi020 commented 2 months ago

I‘m not quite familiar with the LLVM's optimization, so I have several questions here.

Is SROA always enabled?
The command I used to generate the IR is clang -g -fno-discard-value-names -S -emit-llvm <source-code-file> -o output.ll. How can I enable these SROA optimizations?
I designed a tool to detect the inconsistencies between the IR of x86 and riscv. The tool works like this. It first scans the IR and then generates some data. After the scans are done on x86 and riscv machines, I gather the data for x86 and riscv, and then use an algorithm to check the difference. I did not know LLVM very well, so I added some code to ThreadSanitizer's instrumentation pass to achieve my goals: when ThreadSanitizer does the instrumentation, my code will at the same time scan the IR and collect the data. I found these weird memcpy calls in my data, so I guess at the time of ThreadSanitizer's instrumentation, SROA had not been performed. Will it be possible to perform SROA (and any other necessary optimizations) before ThreadSanitizer's instrumentation?

jrtc27 commented 2 months ago

Any -On for n > 0 will enable those kinds of optimisations. You're just producing unoptimised IR, since the default is -O0, and unoptimised IR is deliberately very stupid.

llvm / llvm-project

weird IR generation on RISCV #89866