Open tuxmark5 opened 3 years ago
One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions
Atleast that part happens once you add more states.
Didn't look too deeply into it but the IR fed to LLVM around the loop-match construct is not that different between rustc and clang. However with clang it becomes a switchtable at llvm-ir level, and we rely on the codegen backend to do so with rust https://godbolt.org/z/9Erv94
FWIW from what I can tell the additional comparisons can be blamed on s = <new>
state setting and the phi that results from it.
I thought we already had an optimization for this
bb1: {
switchInt(_1) -> [0_i32: bb3, 1_i32: bb5, 2_i32: bb7, otherwise: bb2]; // scope 0 at src/lib.rs:10:13: 10:14
}
bb4: {
StorageDead(_3); // scope 0 at src/lib.rs:10:29: 10:30
StorageDead(_4); // scope 0 at src/lib.rs:10:30: 10:31
StorageDead(_2); // scope 0 at src/lib.rs:10:30: 10:31
_1 = const 2_i32; // scope 0 at src/lib.rs:10:32: 10:37
goto -> bb1; // scope 0 at src/lib.rs:9:9: 14:10
}
just looking locally at this, the goto -> bb1
should become goto -> bb7
. Basically if a block assigns a constant to a local and then jumps to a block which does not touch that local and ends up conditionally branching on the local, then we can copy all statements from the branching block to the current block and make it an unconditional jump to the right target.
AFAICT this is optimized with #80475
LLVM can optimize this with the -C llvm-args=-jump-threading-across-loop-headers
option. By default LLVM does not perform jump threading across loop headers, because it cannot reliably determine profitability for this case right now. Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.
AFAICT this is optimized with #80475
That PR is merged now. Did that fix this issue?
Testing it with (2021-03-20 61edfd591cedff66fca6) on the playground shows no improvement.
Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.
The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3
Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.
This is worrysome. Our optimization as written won't do anything fancy, so it will make loops more complex. The perf results on that optimization are mostly in the noise (ignoring the CTFE stress tests), though without a dedicated compile-time and runtime test, it's hard to tell. Considering that the compiler bootstrap did not regress, I don't think the opt messes with anything too badly.
Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.
The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3
Or you can use Godbolt: https://rust.godbolt.org/z/doK6W9jPT
(I don't what the Assembly is supposed to look like so I can't tell if this is fixed or not.)
Unfortunately it still doesn't seem fixed. Assembly from godbolt link:
example::exec: << The start of exec() function in rust
sub rsp, 8
mov dword ptr [rsp + 4], edi
.LBB2_1: << The block that implements the 'switch'
mov eax, dword ptr [rsp + 4]
mov dword ptr [rsp], eax
test eax, eax
je .LBB2_3
jmp .LBB2_9
.LBB2_9:
mov eax, dword ptr [rsp]
sub eax, 1
je .LBB2_5
jmp .LBB2_10
.LBB2_10:
mov eax, dword ptr [rsp]
sub eax, 2
je .LBB2_7
jmp .LBB2_2
.LBB2_2:
pop rax
ret
.LBB2_3: << case 0
lea rdi, [rip + .L__unnamed_2]
mov esi, 1
call qword ptr [rip + example::print@GOTPCREL]
mov dword ptr [rsp + 4], 2 << explicit store to stack: s = 2
jmp .LBB2_1 << BAD: jumps back to the 'switch', but s value is known
.LBB2_5: << case 1
lea rdi, [rip + .L__unnamed_3]
mov esi, 1
call qword ptr [rip + example::print@GOTPCREL]
mov dword ptr [rsp + 4], 3 << explicit store to stack: s = 3
jmp .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known
.LBB2_7: << case 2
lea rdi, [rip + .L__unnamed_4]
mov esi, 1
call qword ptr [rip + example::print@GOTPCREL]
mov dword ptr [rsp + 4], 1 << explicit store to stack: s = 1
jmp .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known
I've annotated it a bit with <<
to make it easier to understand. Basically all the match arms eventually jump back to the 'switch' block, where s
is compared/tested.
Maybe LLVM "de-optimizes" our MIR again. Not sure, we'll have to look at the MIR before LLVM, at LLVM-IR without opts and at LLVM-IR with opts to be sure. I'm not sure how to coax godbolt into doing MIR dumps though.
Add --emit=mir
The same can be seen in MIR too:
bb0: {
switchInt(_1) -> [0_i32: bb2, 1_i32: bb4, 2_i32: bb6, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}
bb1: {
return; // scope 0 at ./example.rs:16:2: 16:2
}
bb2: {
_4 = const "0"; // scope 0 at ./example.rs:10:26: 10:29
_3 = _4; // scope 0 at ./example.rs:10:26: 10:29
_2 = print(move _3) -> bb3; // scope 0 at ./example.rs:10:20: 10:30
}
bb3: {
_1 = const 2_i32; // scope 0 at ./example.rs:10:32: 10:37
goto -> bb0; // scope 0 at ./example.rs:9:9: 14:10
}
bb4: {
_7 = const "1"; // scope 0 at ./example.rs:11:26: 11:29
_6 = _7; // scope 0 at ./example.rs:11:26: 11:29
_5 = print(move _6) -> bb5; // scope 0 at ./example.rs:11:20: 11:30
}
bb5: {
_1 = const 3_i32; // scope 0 at ./example.rs:11:32: 11:37
goto -> bb0; // scope 0 at ./example.rs:9:9: 14:10
}
bb6: {
_10 = const "2"; // scope 0 at ./example.rs:12:26: 12:29
_9 = _10; // scope 0 at ./example.rs:12:26: 12:29
_8 = print(move _9) -> bb7; // scope 0 at ./example.rs:12:20: 12:30
}
bb7: {
_1 = const 1_i32; // scope 0 at ./example.rs:12:32: 12:37
goto -> bb0; // scope 0 at ./example.rs:9:9: 14:10
}
After seeing the MIR I thought maybe the issue was the extra "suffix" block after each print operation, so i switcher around assignment and print (instead of doing print("0"); s = 2;
in match arms I tried s = 2; print("0");
). This yielded simpler MIR, but still the same issue was present:
bb0: {
switchInt(_1) -> [0_i32: bb2, 1_i32: bb3, 2_i32: bb4, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}
bb1: {
return; // scope 0 at ./example.rs:16:2: 16:2
}
bb2: {
_1 = const 2_i32; // scope 0 at ./example.rs:10:20: 10:25
_4 = const "0"; // scope 0 at ./example.rs:10:33: 10:36
_3 = _4; // scope 0 at ./example.rs:10:33: 10:36
_2 = print(move _3) -> bb0; // scope 0 at ./example.rs:10:27: 10:37
}
bb3: {
_1 = const 3_i32; // scope 0 at ./example.rs:11:20: 11:25
_7 = const "1"; // scope 0 at ./example.rs:11:33: 11:36
_6 = _7; // scope 0 at ./example.rs:11:33: 11:36
_5 = print(move _6) -> bb0; // scope 0 at ./example.rs:11:27: 11:37
}
bb4: {
_1 = const 1_i32; // scope 0 at ./example.rs:12:20: 12:25
_10 = const "2"; // scope 0 at ./example.rs:12:33: 12:36
_9 = _10; // scope 0 at ./example.rs:12:33: 12:36
_8 = print(move _9) -> bb0; // scope 0 at ./example.rs:12:27: 12:37
}
I just looked at the optimization itself. I do not think the issue is related to the print
. In fact the order as written is the only one the optimization targets right now. So I'm wondering if the MIR that the optimization sees has more statements that get filtered out later. In order to debug this I believe we need some new mir-opt tests and potentially some tracing statements in the optimization
A new optimization pass is currently in review upstream: https://reviews.llvm.org/D99205
With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E
With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E
Optimization breaks if you mark exec
as pub
.
EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking exec
as pub
means that it can be called from other compilation units, hence there is no way to know the initial value of the s
.
With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E
Optimization breaks if you mark
exec
aspub
.EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking
exec
aspub
means that it can be called from other compilation units, hence there is no way to know the initial value of thes
.
I have not looked into this test case (just saw this issue today), but the limitation that you have mentioned here is not hard to fix. We will look into it and get back to you
@bryanpkc This looks like exactly what I need.
This is how a typical state machine might look in Rust:
One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions and once the execution is in one of the states the remaining jumps would be deterministic (non-conditional) jumps.
And this is how equivalent code in gcc is optimized.
However Rust/LLVM at the end of each state performs needless comparisons to determine to which label to jump to next, even though the jump target can be computed statically. You can inspect the generated assembly in Rust Playground.
This is probably more of a LLVM issue rather than Rust directly, but it's important to get this fixed/optimized, because now it's impossible to efficiently write such state machines in Rust (because there is no
goto
)Meta
Tested with all versions available in Rust playground: