cmd/compile: mod operator is very slow on intel hardware with int64

golang / go

The Go programming language

https://go.dev

BSD 3-Clause "New" or "Revised" License

123.73k stars 17.62k forks source link

cmd/compile: mod operator is very slow on intel hardware with int64 #59089

Open rcsaquino opened 1 year ago

rcsaquino commented 1 year ago

What version of Go are you using (`go version`)?

$ go version
go1.20.2 windows/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

Mod operator is very slow on intel hardware when used on int type.

What did you expect to see?

Mod operator optimized to handle int types on intel hardwares

What did you see instead?

The mod operator is very slow when used on int types on intel hardware.

merykitty commented 1 year ago

This is expected, int64 division on Skylake results in 57 micro-ops and has a throughput of 24 cycles/operation, while the figures for int32 are only 10 micro-ops and 6 cycles/operation, change int64 to uint64 also helps reduce these to 36 micro-ops and 21 cycles/operation. As a result, on 64-bit hardware where int is equivalent to int64 the division throughput is a quarter of that of int32 division. Which explains the execution speed differences.

merykitty commented 1 year ago

I have also taken a look at the Rust compiled code, it seems they cleverly check if the upper 32-bits of both the dividend and the divisor are all zero, and fall to faster uint32 division if that is the case. This explains why the Rust implementation is as fast as the JS and V implementations.

.LBB3_10:
    mov     rax, rbx
    xor     edx, edx
    div     rdi.     // slow path, long division
.LBB3_11:
    inc     rdi
    test    rdx, rdx
    je      .LBB3_12
.LBB3_3:
    lea     rax, [r12 + rdi]
    cmp     rax, 2
    je      .LBB3_4
    mov     rax, rbx
    or      rax, rdi // or the dividend and the divisor
    shr     rax, 32  // get the upper 32-bit of the or
    jne     .LBB3_10 // shr set ZF so this is taken if any of the aforementioned 32 bits is non-zero
    mov     eax, ebx
    xor     edx, edx
    div     edi.     // fast path, integer division
    jmp     .LBB3_11

ianlancetaylor commented 1 year ago

CC @randall77

randall77 commented 1 year ago

Seems fine to put in a similar test + 32-bit path in our code.

My only question is how universal is this? Skylake was mentioned - how does this fare on other processor vendors / families?

merykitty commented 1 year ago

It is generally true on x86 that an uint32 division is cheaper than int64 and uint64 ones, but to add a branch to fastpath the latters into the former I think would need more consideration. I think LLVM does so on all intel chips while doing the division normally on AMD ones, which seems reasonable and reflects the throughput and latency differences on architectures. Thanks.