wasmerio / c-wasm-simd128-example

Example C++ repo emitting Wasm SIMD 128 instructions

31 stars 5 forks source link

Analysis of benchmark #1

Open kripken opened 5 years ago

kripken commented 5 years ago

Nice work!

I looked into this benchmark with @tlively now. It's interesting, but we're not sure what's going on here. The main issue is that the SIMD code emitted by the LLVM wasm backend is quite poor, e.g., here is horizontal_add:

 (func $horizontal_add\28float\20vector\5b4\5d\29 (; 28 ;) (type $7) (param $0 v128) (result f32)
  (f32.add
   (f32x4.extract_lane 0
    (local.tee $0
     (f32x4.add
      (local.get $0)
      (v8x16.shuffle 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
       (local.get $0)
       (local.get $0)
      )
     )
    )
   )
   (f32x4.extract_lane 1
    (local.get $0)
   )
  )
 )

That looks quite painful as really just 3 scalar adds are needed here :( But this seems like a limitation of the current wasm SIMD spec. Indeed, testing on V8, the SIMD version is significantly slower (80%) than the scalar version.

So it is puzzling that you see such a big speedup in the wasmer LLVM backend when running on the SIMD wasm. We would guess that LLVM gets this unoptimal SIMD wasm code and autovectorizes it into something fast. That is slightly puzzling though as in simple code like this LLVM might also be able to do the same for the non-SIMD wasm as well. In other words, it seems like the speedup you see is due to the SIMD wasm having slightly better "hinting" that the wasmer LLVM backend takes advantage of?

cc @sunfishcode who may have ideas on what's going on, and may be interested in this example of a wasm VM doing very powerful backend optimizations, that at least some other VMs do not expect.

penzn commented 5 years ago

Horizontal adds can be tricky. Currently the function does this:

float horizontal_add(float4 x) {
  auto partial_sum = x.xy + x.zw;
  return partial_sum.x + partial_sum.y;
}

This moves low and high halves, add them, then does a scalar add on the two lanes in the result. Naively it could be two shuffles, two lane extracts, a SIMD add and a scalar add, but the backend is able to optimize some of it.

At -O0 clang produces

; Function Attrs: noinline nounwind optnone uwtable
define internal float @_ZN12_GLOBAL__N_114horizontal_addEDv4_f(<4 x float> %x) #6 {
entry:
  %x.addr = alloca <4 x float>, align 16
  %partial_sum = alloca <2 x float>, align 8
  store <4 x float> %x, <4 x float>* %x.addr, align 16
  %0 = load <4 x float>, <4 x float>* %x.addr, align 16
  %1 = shufflevector <4 x float> %0, <4 x float> undef, <2 x i32> <i32 0, i32 1>
  %2 = load <4 x float>, <4 x float>* %x.addr, align 16
  %3 = shufflevector <4 x float> %2, <4 x float> undef, <2 x i32> <i32 2, i32 3>
  %add = fadd <2 x float> %1, %3
  store <2 x float> %add, <2 x float>* %partial_sum, align 8
  %4 = load <2 x float>, <2 x float>* %partial_sum, align 8
  %5 = extractelement <2 x float> %4, i64 0
  %6 = load <2 x float>, <2 x float>* %partial_sum, align 8
  %7 = extractelement <2 x float> %6, i64 1
  %add1 = fadd float %5, %7
  ret float %add1
}

Thanks @peterjensen for our discussion on horizontal add

nlewycky commented 5 years ago

The nice thing about using shuffles for horizontal adds is that you can add N elements with log_2(N) add instructions. An issue I had is that it was surprisingly difficult to convince the LLVM Wasm backend to emit shuffles. The equivalent code for horizontally adding 16 uint8_t's gets scalarized, for example.

The instructions I get on my system when building through clang -O2 -march=native -mtune=native is

        vpermilpd       $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
        vaddps  %xmm1, %xmm0, %xmm0
        vmovshdup       %xmm0, %xmm1    # xmm1 = xmm0[1,1,3,3]
        vaddss  %xmm1, %xmm0, %xmm0

which is sweet AVX but I was hoping LLVM would instruction select a horizontal add for this pattern. I haven't checked whether HADDPS would truly be faster.

The IR for horizontal add after our LLVM optimization pass pipeline looks like this:

define float @fn0(%ctx*, i128) personality i32 ()* @__gxx_personality_v0 {
entry:
  %2 = bitcast i128 %1 to <16 x i8>
  %s4 = shufflevector <16 x i8> %2, <16 x i8> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
  %3 = bitcast i128 %1 to <4 x float>
  %4 = bitcast <16 x i8> %s4 to <4 x float>
  %nan = fcmp uno <4 x float> %3, zeroinitializer
  %5 = select <4 x i1> %nan, <4 x float> <float 0x7FF8000000000000, float 0x7FF8000000000000, float 0x7FF8000000000000, float 0x7FF8000000000000>, <4 x float> %3
  %nan1 = fcmp uno <4 x float> %4, zeroinitializer
  %6 = select <4 x i1> %nan1, <4 x float> <float 0x7FF8000000000000, float 0x7FF8000000000000, float 0x7FF8000000000000, float 0x7FF8000000000000>, <4 x float> %4
  %s5 = fadd <4 x float> %5, %6
  %s6 = extractelement <4 x float> %s5, i32 0
  %s8 = extractelement <4 x float> %s5, i32 1
  %7 = insertelement <2 x float> undef, float %s6, i32 0
  %8 = insertelement <2 x float> %7, float %s8, i32 1
  %9 = fcmp uno <2 x float> %8, zeroinitializer
  %10 = select <2 x i1> %9, <2 x float> <float 0x7FF8000000000000, float 0x7FF8000000000000>, <2 x float> %8
  %11 = extractelement <2 x float> %10, i32 0
  %12 = extractelement <2 x float> %10, i32 1
  %s9 = fadd float %11, %12
  ret float %s9
}

mostly because we replace all NaNs with one canonical positive NaN before each operation. The resulting x86-64 is consequently more complicated:

        vmovq  %rdx,%xmm0
        vmovq  %rsi,%xmm1
        vbroadcastss 0x0(%rip),%xmm4
        vxorps %xmm2,%xmm2,%xmm2
        vpunpcklqdq %xmm0,%xmm1,%xmm0
        vpshufb 0x0(%rip),%xmm0,%xmm1
        vcmpunordps %xmm2,%xmm0,%xmm3
        vblendvps %xmm3,%xmm4,%xmm0,%xmm0
        vcmpunordps %xmm2,%xmm1,%xmm3
        vblendvps %xmm3,%xmm4,%xmm1,%xmm1
        vaddps %xmm1,%xmm0,%xmm0
        vcmpunordps %xmm2,%xmm0,%xmm1
        vblendvps %xmm1,%xmm4,%xmm0,%xmm0
        vmovshdup %xmm0,%xmm1
        vaddss %xmm1,%xmm0,%xmm0

We do run the LLVM SLP autovectorizer pass, and the main difference it makes to the particle-repel example is that we get 3 vdivps instructions instead of 12 vdivss.

I'm happy to supply .wat, .o or .s files if anyone wants to look at them. I've looked at the side-by-side diffs, but there's many small changes throughout and it's hard to quickly identify which changes represent a meaningful difference.

dtig commented 5 years ago

Playing around with this a little bit to see what code this generates, and just analyzing the HorizontalAdd, it doesn't look like native pattern matches that well either. This is compiling for x64 without AVX -

The snippet of the LLVM IR for the HorizontalAdd -

  %18 = shufflevector <4 x float> %17, <4 x float> undef, <2 x i32> <i32 0, i32 1>, !dbg !98
  %19 = shufflevector <4 x float> %17, <4 x float> undef, <2 x i32> <i32 2, i32 3>, !dbg !99
  %20 = fadd <2 x float> %18, %19, !dbg !100
  call void @llvm.dbg.value(metadata <2 x float> %20, metadata !92, metadata !DIExpression()), !dbg !96
  %21 = extractelement <2 x float> %20, i32 0, !dbg !101
  %22 = extractelement <2 x float> %20, i32 1, !dbg !102
  %23 = fadd float %21, %22, !dbg !103

And the x64 codegen -

        movaps  xmm4, xmm3
        unpckhpd        xmm4, xmm3      # xmm4 = xmm4[1],xmm3[1]
        addps   xmm4, xmm3
        movaps  xmm3, xmm4
        shufps  xmm3, xmm4, 229         # xmm3 = xmm3[1,1],xmm4[2,3]
        addss   xmm3, xmm4

My naive read of this is that this is what the compiler has to do to preserve the semantics of the function the way it is written.

penzn commented 5 years ago

@nlewycky I'd take a look at the code. I am also curious about what happens when we go from wasm generated for this example to assembly (in Wasmer codegen or JS engine if we port it over).

@kripken, do you know what is the bottleneck for V8?

Wasm code for horizontal add seems to match the source: vector add on lower and higher halves, then scalar add on two lanes in the result. At least some of this is due to the implementation, rather than code generation. For example, we can try reduce number of extract_lane instructions by adding odd and even elements with shuffles in between (this is not necessarily faster):

$ cat horizontal-add.cc
typedef float float4 __attribute__((ext_vector_type(4)));

float horizontal_add(float4 x) {
  auto partial_sum = x.xy + x.zw;
  return partial_sum.x + partial_sum.y;
}

float horizontal_add_oddeven(float4 x) {
  float4 y = x.yxwz; // lanes 1 0 3 2
  float4 sum = x + y;
  return (sum + sum.zwxy).x;
}

$ clang++ --target=wasm32 horizontal-add.cc -O3 -msimd128 -S -emit-llvm -o -

; Function Attrs: norecurse nounwind readnone
define hidden float @_Z14horizontal_addDv4_f(<4 x float> %x) local_unnamed_addr #0 {
entry:
  %0 = shufflevector <4 x float> %x, <4 x float> undef, <2 x i32> <i32 0, i32 1>
  %1 = shufflevector <4 x float> %x, <4 x float> undef, <2 x i32> <i32 2, i32 3>
  %add = fadd <2 x float> %0, %1
  %2 = extractelement <2 x float> %add, i32 0
  %3 = extractelement <2 x float> %add, i32 1
  %add1 = fadd float %2, %3
  ret float %add1
}

; Function Attrs: norecurse nounwind readnone
define hidden float @_Z22horizontal_add_oddevenDv4_f(<4 x float> %x) local_unnamed_addr #0 {
entry:
  %0 = shufflevector <4 x float> %x, <4 x float> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
  %add = fadd <4 x float> %0, %x
  %1 = shufflevector <4 x float> %add, <4 x float> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
  %add1 = fadd <4 x float> %add, %1
  %2 = extractelement <4 x float> %add1, i32 0
  ret float %2
}

$ clang++ --target=wasm32 horizontal-add.cc -O3 -msimd128 -S -o -

_Z14horizontal_addDv4_f:                # @_Z14horizontal_addDv4_f                                      
        .functype       _Z14horizontal_addDv4_f (v128) -> (f32)                                         
# %bb.0:                                # %entry                                                        
        local.get       0                                                                               
        local.get       0                                                                               
        local.get       0                                                                               
        v8x16.shuffle   8, 9, 10, 11, 12, 13, 14, 15, 0, 0, 0, 0, 0, 0, 0, 0                            
        f32x4.add                                                                                       
        local.tee       0                                                                               
        f32x4.extract_lane      0                                                                       
        local.get       0                                                                               
        f32x4.extract_lane      1                                                                       
        f32.add                                                                                         
                                        # fallthrough-return-value                                      
        end_function                                                                                    

_Z22horizontal_add_oddevenDv4_f:        # @_Z22horizontal_add_oddevenDv4_f                              
        .functype       _Z22horizontal_add_oddevenDv4_f (v128) -> (f32)                                 
# %bb.0:                                # %entry                                                        
        local.get       0                                                                               
        local.get       0                                                                               
        v8x16.shuffle   4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 8, 9, 10, 11                            
        local.get       0                                                                               
        f32x4.add                                                                                       
        local.tee       0                                                                               
        local.get       0                                                                               
        local.get       0                                                                               
        v8x16.shuffle   8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0                                
        f32x4.add                                                                                       
        f32x4.extract_lane      0                                                                       
                                        # fallthrough-return-value                                      
        end_function

(some extra output removed)

Doing more shuffles does not reduce bytecode instruction count, because shuffles take two operands on the stack, which results in two local.get instructions (even when only one is used). If there was a single operand swizzle with a constant mask then the two-shuffle version would be shorter. On the other hand, I am curious what would assembly code in the runtime look like, there is a chance that local.get would not result in extra instruction if the local is already in a register.

kripken commented 5 years ago

@kripken, do you know what is the bottleneck for V8?

I don't think there's a specific v8 issue here - it's more that as in the code sample I gave earlier, the SIMD code is larger than the corresponding scalar code. I'd expect all wasm VMs that run in browsers to do worse on the SIMD version since they don't consider running long computations like LLVM does, they are tuned for JITing. LLVM may autovectorize etc. but fast wasm VMs tend to assume the toolchain has already optimized the code heavily. This benchmark shows a counterpoint, where heavy post-wasm optimization is needed.

nlewycky commented 5 years ago

@penzn What in particular would you like to see? Just horizontal_add or the whole benchmark? As LLVM IR or as native code? I can do things like emit IR before or after our LLVM optimization passes? I can also change those passes to examine particular things like disabling the autovectorizer?

@kripken For the full benchmark the number of Wasm opcodes is 13,355 with SIMD as opposed to 13,429 without SIMD. I also emitted object files with LLVM's autovectorizer enabled and disabled, and compared them, and the only meaningful difference I was able to spot was the vectorization of the divides. I'm not as sure as you that the performance difference is due to heavy optimization by the LLVM backend. (The obvious simple test -- turn off llvm optimizations in Wasmer -- probably wouldn't work because we emit correct-but-slow LLVM IR and rely on its optimizations to fix it up.)

kripken commented 5 years ago

@nlewycky

(The obvious simple test -- turn off llvm optimizations in Wasmer -- probably wouldn't work because we emit correct-but-slow LLVM IR and rely on its optimizations to fix it up.)

Ah too bad, that would have been a good test...

The reverse test that I did, where v8 runs the SIMD version 80% slower, is the main reason I am worried here, and the other is the code example I pasted, which looks pretty obviously worse in the SIMD version, I think? For those two reasons I still think heavy LLVM optimization is the likeliest explanation here - note, not necessarily autovectorization, but also general LLVM opts.

But I agree it would be good to get more data points. Not sure if there's another VM that can run wasm SIMD yet though.

penzn commented 5 years ago

@nlewycky what you think has changed, if horizontal add stays the same across optimization then it is probably not useful. Same for LLVM IR vs native -- if IR is indicative of the improvement, then that would be a interesting thing to see. In short what you'd think would be useful for somebody to see.

I've built wasmer, has not had time to experiment with this benchmark yet.

penzn commented 5 years ago

Ping @nlewycky

Should the results be reproducible with Wasmer 0.6.0 from get.wasmer.io? For some reason I can't get the parity between Wasm and native on Ubuntu 18.04 (on Core-i7-7700K). I can see that Wasi SDK (clang 8) uses the right flags and emits the right instructions, but can't see inside Wasmer yet.

Is there a flag to release Wasmer that would let me dump LLVM IR that it is working on?

nlewycky commented 5 years ago

I'm not sure exactly what revision or hardware the benchmarks were run on. Ping @syrusakbary who ran the benchmarks?

I've added flags to Wasmer as of git e29a89a3f81f3a4ddff27c06fab11f495f90bdb5 which should let you examine the LLVM IR and object file:

$ wasmer run --backend=llvm --disable-cache \
    --backend-llvm-object-file particle-repel.o\
    --backend-llvm-post-opt-ir particle-repel.post-opt.ll \
    --backend-llvm-pre-opt-ir particle-repel.pre-opt.ll \
    particle-repel.wasm

The --disable-cache flag ensures that the LLVM backend runs instead of looking up the precompiled result in the Wasmer cache.

We could get another data point after we add SIMD support to Wasmer's "singlepass" backend, which is a "slab JIT" (I don't think this term in standardized -- I mean a JIT that emits a prefab sequence of native instructions per Wasm instruction, and that prefab sequence must be prepared to handle the behaviour of the Wasm instruction in its full generality).

syrusakbary commented 5 years ago

I'm not sure exactly what revision or hardware the benchmarks were run on. Ping @syrusakbary who ran the benchmarks?

It seems there is a great regression in speed (both SIMD and non-SIMD examples) since I tested this repo with my previous version of wasmer-SIMD and now.

In the published wasmer 0.6.0 the speed is much slower than what I benchmarked and showcased in the blogpost. After a lot of research, I've been able to reproduce my old timings by going back in time to a commit in the SIMD branch.

Speed Analysis

Old commit (fast version)

When using wasmer with the simd-fast branch (280 commits behind from current master), the speeds are similar to what the article (and this repo) shows (almost native performance).

How to build wasmer:

# Build wasmer
$ git checkout simd-fast
$ make release-llvm

And then use the wasmer binary from the target/release folder

Old-version SIMD ~ 5s

$ time ./target/release/wasmer run --backend=llvm particle-repel.wasm # We didn't add the flag --enable-simd yet

Beginning simulation.
5.5555e-21
1.0105e-21
7.70315e-20
1.0108e-17
1.87824e-17
3.62431e-16
2.24578e-14
1.24264e-13
2.40938e-13
3.14344e-13
4.14095e-13
6.33993e-13
4.68646e-10
6.39859e-10
1.15538e-09
2.35684e-10
1.05239e-09
4.05415e-10
7.90286e-10
7.75427e-10
/Users/syrusakbary/Development/wasmer/target/release/wasmer run --backend=llv  6.19s user 0.04s system 99% cpu 5.294 total

Old-version non-SIMD ~ 12s

$ time ./target/release/wasmer run --backend=llvm particle-repel.wasm

Beginning simulation.
5.5555e-21
1.0105e-21
7.70315e-20
1.0108e-17
1.87824e-17
3.62431e-16
2.24578e-14
1.24264e-13
2.40938e-13
3.14344e-13
4.14095e-13
6.33993e-13
4.68646e-10
6.39859e-10
1.15538e-09
2.35684e-10
1.05239e-09
4.05415e-10
7.90286e-10
7.75427e-10
/Users/syrusakbary/Development/wasmer/target/release/wasmer run --backend=llv  12.65s user 0.07s system 99% cpu 12.777 total

Master (slow version)

The current master SIMD version is somehow 2x slower than the old branch.

Master SIMD ~ 20s

$ time wasmer run --backend=llvm --enable-simd build/particle-repel-simd.wasm

Beginning simulation.
5.5555e-21
1.0105e-21
7.70315e-20
1.0108e-17
1.87824e-17
3.62431e-16
2.24578e-14
1.24264e-13
2.40938e-13
3.14344e-13
4.14095e-13
6.33993e-13
4.68646e-10
6.39859e-10
1.15538e-09
2.35684e-10
1.05239e-09
4.05415e-10
7.90286e-10
7.75427e-10
wasmer run --backend=llvm --enable-simd build/particle-repel-simd.wasm  20.75s user 0.11s system 99% cpu 20.991 total

Master non-SIMD ~ 40s

$ time wasmer run --backend=llvm build/particle-repel.wasm

Beginning simulation.
5.5555e-21
1.0105e-21
7.70315e-20
1.0108e-17
1.87824e-17
3.62431e-16
2.24578e-14
1.24264e-13
2.40938e-13
3.14344e-13
4.14095e-13
6.33993e-13
4.68646e-10
6.39859e-10
1.15538e-09
2.35684e-10
1.05239e-09
4.05415e-10
7.90286e-10
7.75427e-10
wasmer run --backend=llvm build/particle-repel.wasm  41.51s user 0.19s system 99% cpu 42.001 total

Native version ~ 5s

When running it locally with the native executable, this are the timings in my machine:

time ./particle-repel
Beginning simulation.
5.5555e-21
1.0105e-21
7.70315e-20
1.0108e-17
1.87824e-17
3.62431e-16
2.24578e-14
1.24264e-13
2.40938e-13
3.14344e-13
4.14095e-13
6.33993e-13
4.68646e-10
6.39859e-10
1.15538e-09
2.35684e-10
1.05239e-09
4.05415e-10
7.90286e-10
7.75427e-10
./particle-repel  5.10s user 0.02s system 99% cpu 5.136 total

syrusakbary commented 5 years ago

As a note, to help @nlewycky debug what's going on. Here's the difference between 0.6.0 and the simd-fast branch:

https://github.com/wasmerio/wasmer/compare/simd-fast...0.6.0?expand=1

syrusakbary commented 5 years ago

The slowdown seems to be caused by NaNs canonicalization (changes were done just after the time I tested).

This can hit hard performance. Since we can afford to have non-deterministic on NaNs (thanks to the spec), I've created a PR that reverts the canonicalization in wasmer and brings its speed up to the numbers in the article :)

https://github.com/wasmerio/wasmer/pull/651

penzn commented 5 years ago

Thank you both for quick responses!

@nlewycky thank you for the flags, that gives me something to work with. I was able to dump IR with this (built from master):

$ wasmer run --backend=llvm --disable-cache \
    --backend-llvm-object-file particle-repel.o\
    --llvm-post-opt-ir particle-repel.post-opt.ll \
    --llvm-pre-opt-ir particle-repel.pre-opt.ll \
    particle-repel.wasm

@syrusakbary that is in line with what I was seeing, thank you! I'll try to build with the patch from the PR that reverts canonicalization. What kind of system are your running this on? I have tried the branch in wasmerio/wasmer#651, but seems to be on par with master on the machine I am on. However a build from simd-fast branch runs much faster (similar to what is described)