Open rcsaquino opened 1 year ago
This is expected, int64
division on Skylake results in 57 micro-ops and has a throughput of 24 cycles/operation, while the figures for int32
are only 10 micro-ops and 6 cycles/operation, change int64
to uint64
also helps reduce these to 36 micro-ops and 21 cycles/operation. As a result, on 64-bit hardware where int
is equivalent to int64
the division throughput is a quarter of that of int32
division. Which explains the execution speed differences.
I have also taken a look at the Rust compiled code, it seems they cleverly check if the upper 32-bits of both the dividend and the divisor are all zero, and fall to faster uint32
division if that is the case. This explains why the Rust implementation is as fast as the JS and V implementations.
.LBB3_10:
mov rax, rbx
xor edx, edx
div rdi. // slow path, long division
.LBB3_11:
inc rdi
test rdx, rdx
je .LBB3_12
.LBB3_3:
lea rax, [r12 + rdi]
cmp rax, 2
je .LBB3_4
mov rax, rbx
or rax, rdi // or the dividend and the divisor
shr rax, 32 // get the upper 32-bit of the or
jne .LBB3_10 // shr set ZF so this is taken if any of the aforementioned 32 bits is non-zero
mov eax, ebx
xor edx, edx
div edi. // fast path, integer division
jmp .LBB3_11
CC @randall77
Seems fine to put in a similar test + 32-bit path in our code.
My only question is how universal is this? Skylake was mentioned - how does this fare on other processor vendors / families?
It is generally true on x86 that an uint32
division is cheaper than int64
and uint64
ones, but to add a branch to fastpath the latters into the former I think would need more consideration. I think LLVM does so on all intel chips while doing the division normally on AMD ones, which seems reasonable and reflects the throughput and latency differences on architectures. Thanks.
Change https://go.dev/cl/482658 mentions this issue: cmd/compile/internal/amd64: improve fix up code for signed division
Change https://go.dev/cl/482656 mentions this issue: cmd/compile: add math benchmarks
Change https://go.dev/cl/482657 mentions this issue: cmd/compile/internal/amd64: simplify code generation for signed division
Change https://go.dev/cl/482659 mentions this issue: cmd/compile: prefer 32 bit unsigned division on amd64
Change https://go.dev/cl/484438 mentions this issue: cmd/compile: prefer 32 bit signed division on amd64
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What did you do?
Mod operator is very slow on intel hardware when used on int type.
What did you expect to see?
Mod operator optimized to handle int types on intel hardwares
What did you see instead?
The mod operator is very slow when used on int types on intel hardware.