Rust insufficiently optimizes loop { match { } } state machines

tuxmark5 commented 3 years ago

This is how a typical state machine might look in Rust:

#[inline(never)]
fn print(s: &str) {
    println!("{}", s);
}

#[inline(never)]
fn exec(mut s: i32) {
    loop {
        match s {
            0 => { print("0"); s = 2; },
            1 => { print("1"); s = 3; },
            2 => { print("2"); s = 1; },
            _ => return
        }
    }
}

fn main() {
    exec(0);
}

One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions and once the execution is in one of the states the remaining jumps would be deterministic (non-conditional) jumps.

And this is how equivalent code in gcc is optimized.

However Rust/LLVM at the end of each state performs needless comparisons to determine to which label to jump to next, even though the jump target can be computed statically. You can inspect the generated assembly in Rust Playground.

This is probably more of a LLVM issue rather than Rust directly, but it's important to get this fixed/optimized, because now it's impossible to efficiently write such state machines in Rust (because there is no goto)

Meta

Tested with all versions available in Rust playground:

1.49.0
1.50.0-beta.2 (2020-12-31 25b3db3aa752d9d4633c)
1.51.0-nightly (2020-12-30 e2267046859c9ceb932a)

the8472 commented 3 years ago

One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions

Atleast that part happens once you add more states.

nagisa commented 3 years ago

Didn't look too deeply into it but the IR fed to LLVM around the loop-match construct is not that different between rustc and clang. However with clang it becomes a switchtable at llvm-ir level, and we rely on the codegen backend to do so with rust https://godbolt.org/z/9Erv94

FWIW from what I can tell the additional comparisons can be blamed on s = <new> state setting and the phi that results from it.

oli-obk commented 3 years ago

I thought we already had an optimization for this

    bb1: {
        switchInt(_1) -> [0_i32: bb3, 1_i32: bb5, 2_i32: bb7, otherwise: bb2]; // scope 0 at src/lib.rs:10:13: 10:14
    }
    bb4: {
        StorageDead(_3);                 // scope 0 at src/lib.rs:10:29: 10:30
        StorageDead(_4);                 // scope 0 at src/lib.rs:10:30: 10:31
        StorageDead(_2);                 // scope 0 at src/lib.rs:10:30: 10:31
        _1 = const 2_i32;                // scope 0 at src/lib.rs:10:32: 10:37
        goto -> bb1;                     // scope 0 at src/lib.rs:9:9: 14:10
    }

just looking locally at this, the goto -> bb1 should become goto -> bb7. Basically if a block assigns a constant to a local and then jumps to a block which does not touch that local and ends up conditionally branching on the local, then we can copy all statements from the branching block to the current block and make it an unconditional jump to the right target.

simonvandel commented 3 years ago

AFAICT this is optimized with #80475

nikic commented 3 years ago

LLVM can optimize this with the -C llvm-args=-jump-threading-across-loop-headers option. By default LLVM does not perform jump threading across loop headers, because it cannot reliably determine profitability for this case right now. Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.

camelid commented 3 years ago

AFAICT this is optimized with #80475

That PR is merged now. Did that fix this issue?

tuxmark5 commented 3 years ago

Testing it with (2021-03-20 61edfd591cedff66fca6) on the playground shows no improvement.

oli-obk commented 3 years ago

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3

Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.

This is worrysome. Our optimization as written won't do anything fancy, so it will make loops more complex. The perf results on that optimization are mostly in the noise (ignoring the CTFE stress tests), though without a dedicated compile-time and runtime test, it's hard to tell. Considering that the compiler bootstrap did not regress, I don't think the opt messes with anything too badly.

camelid commented 3 years ago

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3

Or you can use Godbolt: https://rust.godbolt.org/z/doK6W9jPT

(I don't what the Assembly is supposed to look like so I can't tell if this is fixed or not.)

tuxmark5 commented 3 years ago

Unfortunately it still doesn't seem fixed. Assembly from godbolt link:

example::exec:  << The start of exec() function in rust
        sub     rsp, 8
        mov     dword ptr [rsp + 4], edi
.LBB2_1: << The block that implements the 'switch' 
        mov     eax, dword ptr [rsp + 4]
        mov     dword ptr [rsp], eax
        test    eax, eax
        je      .LBB2_3
        jmp     .LBB2_9
.LBB2_9: 
        mov     eax, dword ptr [rsp]
        sub     eax, 1
        je      .LBB2_5
        jmp     .LBB2_10
.LBB2_10:
        mov     eax, dword ptr [rsp]
        sub     eax, 2
        je      .LBB2_7
        jmp     .LBB2_2
.LBB2_2:
        pop     rax
        ret
.LBB2_3: << case 0
        lea     rdi, [rip + .L__unnamed_2]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 2 << explicit store to stack: s = 2
        jmp     .LBB2_1 << BAD: jumps back to the 'switch', but s value is known
.LBB2_5: << case 1
        lea     rdi, [rip + .L__unnamed_3]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 3 << explicit store to stack: s = 3
        jmp     .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known
.LBB2_7: << case 2
        lea     rdi, [rip + .L__unnamed_4]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 1 << explicit store to stack: s = 1
        jmp     .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known

I've annotated it a bit with << to make it easier to understand. Basically all the match arms eventually jump back to the 'switch' block, where s is compared/tested.

oli-obk commented 3 years ago

Maybe LLVM "de-optimizes" our MIR again. Not sure, we'll have to look at the MIR before LLVM, at LLVM-IR without opts and at LLVM-IR with opts to be sure. I'm not sure how to coax godbolt into doing MIR dumps though.

the8472 commented 3 years ago

Add --emit=mir

tuxmark5 commented 3 years ago

The same can be seen in MIR too:

bb0: {
    switchInt(_1) -> [0_i32: bb2, 1_i32: bb4, 2_i32: bb6, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}

bb1: {
    return;                          // scope 0 at ./example.rs:16:2: 16:2
}

bb2: {
    _4 = const "0";                  // scope 0 at ./example.rs:10:26: 10:29
    _3 = _4;                         // scope 0 at ./example.rs:10:26: 10:29
    _2 = print(move _3) -> bb3;      // scope 0 at ./example.rs:10:20: 10:30
}

bb3: {
    _1 = const 2_i32;                // scope 0 at ./example.rs:10:32: 10:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

bb4: {
    _7 = const "1";                  // scope 0 at ./example.rs:11:26: 11:29
    _6 = _7;                         // scope 0 at ./example.rs:11:26: 11:29
    _5 = print(move _6) -> bb5;      // scope 0 at ./example.rs:11:20: 11:30
}

bb5: {
    _1 = const 3_i32;                // scope 0 at ./example.rs:11:32: 11:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

bb6: {
    _10 = const "2";                 // scope 0 at ./example.rs:12:26: 12:29
    _9 = _10;                        // scope 0 at ./example.rs:12:26: 12:29
    _8 = print(move _9) -> bb7;      // scope 0 at ./example.rs:12:20: 12:30
}

bb7: {
    _1 = const 1_i32;                // scope 0 at ./example.rs:12:32: 12:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

After seeing the MIR I thought maybe the issue was the extra "suffix" block after each print operation, so i switcher around assignment and print (instead of doing print("0"); s = 2; in match arms I tried s = 2; print("0");). This yielded simpler MIR, but still the same issue was present:

bb0: {
    switchInt(_1) -> [0_i32: bb2, 1_i32: bb3, 2_i32: bb4, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}

bb1: {
    return;                          // scope 0 at ./example.rs:16:2: 16:2
}

bb2: {
    _1 = const 2_i32;                // scope 0 at ./example.rs:10:20: 10:25
    _4 = const "0";                  // scope 0 at ./example.rs:10:33: 10:36
    _3 = _4;                         // scope 0 at ./example.rs:10:33: 10:36
    _2 = print(move _3) -> bb0;      // scope 0 at ./example.rs:10:27: 10:37
}

bb3: {
    _1 = const 3_i32;                // scope 0 at ./example.rs:11:20: 11:25
    _7 = const "1";                  // scope 0 at ./example.rs:11:33: 11:36
    _6 = _7;                         // scope 0 at ./example.rs:11:33: 11:36
    _5 = print(move _6) -> bb0;      // scope 0 at ./example.rs:11:27: 11:37
}

bb4: {
    _1 = const 1_i32;                // scope 0 at ./example.rs:12:20: 12:25
    _10 = const "2";                 // scope 0 at ./example.rs:12:33: 12:36
    _9 = _10;                        // scope 0 at ./example.rs:12:33: 12:36
    _8 = print(move _9) -> bb0;      // scope 0 at ./example.rs:12:27: 12:37
}

oli-obk commented 3 years ago

I just looked at the optimization itself. I do not think the issue is related to the print. In fact the order as written is the only one the optimization targets right now. So I'm wondering if the MIR that the optimization sees has more statements that get filtered out later. In order to debug this I believe we need some new mir-opt tests and potentially some tracing statements in the optimization

cynecx commented 3 years ago

A new optimization pass is currently in review upstream: https://reviews.llvm.org/D99205

nikic commented 3 years ago

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

tuxmark5 commented 3 years ago

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

Optimization breaks if you mark exec as pub.

EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking exec as pub means that it can be called from other compilation units, hence there is no way to know the initial value of the s.

amehsan commented 2 years ago

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

Optimization breaks if you mark exec as pub.

EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking exec as pub means that it can be called from other compilation units, hence there is no way to know the initial value of the s.

I have not looked into this test case (just saw this issue today), but the limitation that you have mentioned here is not hard to fix. We will look into it and get back to you

bryanpkc commented 1 year ago

@tuxmark5 Does D124394 solve this issue for you?

tuxmark5 commented 1 year ago

@bryanpkc This looks like exactly what I need.

rust-lang / rust

Rust insufficiently optimizes loop { match { } } state machines #80630

Meta