optimize u128/i128 popcounts further

sadlerap commented 1 year ago

Don't fall back on breaking apart the popcount operation if 128-bit integers are natively supported.

antoyo commented 1 year ago

Thanks!

Do you think it would be worth trying to make it work even for non-native integers by using the operations from the int module as I suggested earlier?

sadlerap commented 1 year ago

I'll give it a try - if it ends up working, I'll replace what I've posted here with whatever works. If not, I'll let you know what fails.

sadlerap commented 1 year ago

I think I got a bug in libgccjit.

When I compile with this patch, I get the following error:

Details

``` rustc_codegen_gcc is build for rustc 1.75.0-nightly (97c81e1b5 2023-10-07) but the default rustc version is rustc 1.75.0-nightly (49691b1f7 2023-10-16). Using rustc 1.75.0-nightly (97c81e1b5 2023-10-07). Compiling minmax_test v0.1.0 (/home/andy/repos/minmax_test) libgccjit.so: warning: command-line option ‘-x86-asm-syntax=intel’ is valid for Modula-2 but not for error: rustc interrupted by SIGSEGV, printing backtrace /home/andy/.rustup/toolchains/nightly-2023-10-08-x86_64-unknown-linux-gnu/lib/librustc_driver-3558b9258144fc36.so(+0x2aff526)[0x7f00466ff526] /usr/lib/libc.so.6(+0x3e710)[0x7f00438c6710] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592255)[0x7f0038392255] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592f78)[0x7f0038392f78] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x597d84)[0x7f0038397d84] ### cycle encountered after 5 frames with period 6 /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592780)[0x7f0038392780] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x59e41d)[0x7f003839e41d] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x5930a4)[0x7f00383930a4] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595f8d)[0x7f0038395f8d] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595fc7)[0x7f0038395fc7] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x597e87)[0x7f0038397e87] ### recursed 41 times /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592780)[0x7f0038392780] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x59e41d)[0x7f003839e41d] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x5930a4)[0x7f00383930a4] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595f8d)[0x7f0038395f8d] /home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595fc7)[0x7f0038395fc7] note: rustc unexpectedly overflowed its stack! this is a bug note: maximum backtrace depth reached, frames may have been lost note: we would appreciate a report at https://github.com/rust-lang/rust note: backtrace dumped due to SIGSEGV! resuming signal error: could not compile `minmax_test` (lib) Caused by: process didn't exit successfully: `/home/andy/.rustup/toolchains/nightly-2023-10-08-x86_64-unknown-linux-gnu/bin/rustc --crate-name minmax_test --edition=2021 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=141 --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --emit asm -Ccodegen-units=1 -C llvm-args=-x86-asm-syntax=intel -Cdebuginfo=2 -C metadata=19b01f69af993630 -C extra-filename=-19b01f69af993630 --out-dir /home/andy/repos/minmax_test/target/release/deps -L dependency=/home/andy/repos/minmax_test/target/release/deps -Csymbol-mangling-version=v0 -Cdebuginfo=2 -Clto=off -Zcodegen-backend=/home/andy/repos/rustc_codegen_gcc/target/release/librustc_codegen_gcc.so --sysroot /home/andy/repos/rustc_codegen_gcc/build_sysroot/sysroot` (signal: 11, SIGSEGV: invalid memory reference) Cargo failed with exit status: 101 ```

antoyo commented 1 year ago

Indeed. Good find! You can see in the CI that it's because you use an array for a binary operation. I presume it's in the call to add_assignment_op(). libgccjit doesn't check that the values in this call are numeric. I'll fix this! Thanks!

sadlerap commented 1 year ago

hmm, I wonder if replacing the assignment ops with just plain assignments would work. I'll give it a try.

antoyo commented 1 year ago

I added the type check in this commit.

Yeah, that might work indeed.

sadlerap commented 1 year ago

Sadly, while subbing out assignment ops with pure assignments works around the bug, it also results in poor codegen when 128-bit integers aren't supported and the fallback is skipped.

Details

Before: ```asm popcount_128: xor eax, eax popcnt rax, rsi test rsi, rsi cmove rax, rsi xor ecx, ecx popcnt rcx, rdi mov rdx, rax add rax, rcx test rdi, rdi cmove rax, rdx ret ``` After: ```asm popcount_128: push r14 mov rax, rsi push r13 push r12 push rbp push rbx sub rsp, 32 or rax, rdi mov QWORD PTR 16[rsp], rdi mov QWORD PTR 24[rsp], rsi vmovdqa xmm3, XMMWORD PTR 16[rsp] vmovdqa XMMWORD PTR [rsp], xmm3 je .L67 mov rbp, rsi mov rbx, rdi xor r12d, r12d .L66: mov r14, QWORD PTR [rsp] mov r13, QWORD PTR 8[rsp] xor ecx, ecx mov edx, 1 add r12d, 1 mov rdi, r14 mov rsi, r13 call __rust_u128_sub@PLT xor ecx, ecx mov edx, 1 mov rdi, r14 mov rsi, r13 and rbx, rax mov QWORD PTR 16[rsp], rax call __rust_u128_sub@PLT vmovq xmm0, QWORD PTR 16[rsp] mov rax, rbx and rbp, rdx vpinsrq xmm0, xmm0, rdx, 1 vpand xmm1, xmm0, XMMWORD PTR [rsp] or rax, rbp vmovdqa XMMWORD PTR [rsp], xmm1 jne .L66 mov eax, r12d .L63: add rsp, 32 pop rbx pop rbp pop r12 pop r13 pop r14 ret .L67: xor eax, eax jmp .L63 ```

However, when 128-bit integers are supported, we do see the improvement as expected.

Details

Before: ```asm popcount_128: xor eax, eax popcnt rax, rsi test rsi, rsi cmove rax, rsi xor ecx, ecx popcnt rcx, rdi mov rdx, rax add rax, rcx test rdi, rdi cmove rax, rdx ret ``` After: ```asm popcount_128: mov rdx, rdi or rdx, rsi je .L65 xor ecx, ecx popcnt rsi, rsi popcnt rcx, rdi lea eax, [rsi+rcx] ret .L65: xor eax, eax ret ```

I think if we gate the fallback behind native support for 128-bit integers, we'll get the best of both worlds - good-ish codegen when 128-bit integers are supported, and a roughly as-good fallback when they aren't.

antoyo commented 1 year ago

Do you know (or did you do some benchmarks to check) if having one jump is indeed faster than the 2 conditional moves?

sadlerap commented 1 year ago

I wrote some microbenchmarks with criterion, but they seemed to get stuck on the anaylsis step (let it run for about 15 minutes and it still wasn't done, whereas the same code compiled with llvm completed in a few seconds). I'll try some other options, see if I get more luck there.

sadlerap commented 1 year ago

Alright, I think I have some solid numbers (at least for my machine).

My benchmarks run popcount 1000 times on the same number and add the results together. This is done to inflate iteration times a bit - with an operation as small as this, it's hard to get good numbers for how long it takes to complete.

Before:

test tests::bench_popcount_u128_0       ... bench:         668 ns/iter (+/- 57)
test tests::bench_popcount_u128_1       ... bench:         680 ns/iter (+/- 9)
test tests::bench_popcount_u128_10      ... bench:         683 ns/iter (+/- 12)
test tests::bench_popcount_u128_minus_1 ... bench:         705 ns/iter (+/- 16)

After:

test tests::bench_popcount_u128_0       ... bench:         580 ns/iter (+/- 5)
test tests::bench_popcount_u128_1       ... bench:         592 ns/iter (+/- 91)
test tests::bench_popcount_u128_10      ... bench:         595 ns/iter (+/- 186)
test tests::bench_popcount_u128_minus_1 ... bench:         593 ns/iter (+/- 12)

Source:

#![feature(test)]

pub fn popcount_128(a: u128) -> u32 {
    a.count_ones()
}

#[cfg(test)]
pub mod tests {
    use super::*;
    use test::{Bencher, black_box};

    #[bench]
    fn bench_popcount_u128_0(b: &mut Bencher) {
        let x = black_box(0u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_1(b: &mut Bencher) {
        let x = black_box(1u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_10(b: &mut Bencher) {
        let x = black_box(10u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_minus_1(b: &mut Bencher) {
        let x = black_box(-1i128 as u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

}

Not sure if other processor types will see similar results - I'm on a R7 5800X, but you might see different results on Intel processors/older AMD processors.

I think a lot of the branch/conditional moves will go away once #351 gets resolved, so this might not even matter in the long term.

antoyo commented 1 year ago

Thanks a lot for your work!

rust-lang / rustc_codegen_gcc

optimize u128/i128 popcounts further #354