Open brycekahle opened 2 weeks ago
Thank you for the report. I reduced example to the following:
$ cat reduced.ll
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "bpf"
define i32 @foo() {
entry:
%0 = load i8, ptr null, align 1
%.fr = freeze i8 %0
%cmp = icmp eq i8 %.fr, 1
%cmp3 = icmp sgt i8 %.fr, 0
%or.cond = or i1 %cmp, %cmp3
br i1 %or.cond, label %cleanup, label %switch.early.test
switch.early.test: ; preds = %entry
%switch.selectcmp.case1 = icmp eq i8 0, 0
br label %cleanup
cleanup: ; preds = %switch.early.test, %entry
ret i32 0
}
$ llc --mtriple=bpf -mcpu=v1 -o - reduced.ll
.text
.file "reduced.ll"
.globl foo # -- Begin function foo
.p2align 3
.type foo,@function
foo: # @foo
.cfi_startproc
# %bb.0: # %entry
r1 = 0
r1 = *(u8 *)(r1 + 0)
r2 = r1
r2 &= 255
if r2 == 1 goto LBB0_2
# %bb.1: # %entry
r1 <<= 56
r1 s>>= 56
if r1 s> 0 goto LBB0_2
LBB0_2: # %cleanup
r0 = 0
exit
.Lfunc_end0:
.size foo, .Lfunc_end0-foo
.cfi_endproc
# -- End function
And llvm-reduce
refuses to simplify it any further. Will take a look.
@yonghong-song , fyi
An update so far. The culprit is the freeze
instruction. Had to dig up this paper to figure out what it is and why is it lowered from selection dag to machine instructions as a register copy (in SelectionDAGISel::Select_FREEZE()
).
If I remove freeze
from the ll example above:
- %0 = load i8, ptr null, align 1
- %.fr = freeze i8 %0
+ %.fr = load i8, ptr null, align 1
Note the transformation chain:
Initial selection DAG:
t3: i8,ch = load<(load (s8) from `ptr null`)> t0, Constant:i64<0>, undef:i64
...
t11: i64 = sign_extend t3
t13: ch = CopyToReg t0, Register:i64 %0, t11
Optimized lowered selection DAG:
t18: i64,ch = load<(load (s8) from `ptr null`), sext from i8> t0, Constant:i64<0>, undef:i64
^^^^
t13: ch = CopyToReg t0, Register:i64 %0, t18
Legalized selection DAG:
t24: i64,ch = load<(load (s8) from `ptr null`), anyext from i8> t0, Constant:i64<0>, undef:i64
^^^^^^
t28: i64 = shl t24, Constant:i64<56>
t29: i64 = sra t28, Constant:i64<56>
t13: ch = CopyToReg t0, Register:i64 %0, t29
Essentially, this matches expression (sign_extend (load ...))
, replaces it with (load sext ...)
and finally replaces that with (load anyext ...)
.
However, in presence of freeze the expression looks as (sign_extend (freeze (load ...)))
and DAG optimization pattern (sign_extend (load ...)) -> (load sext ...)
no longer applies. Hence, sign_extend
operation remains in place and is lowered to machine instructions as &= 255
. As discussed here, it is not correct to convert (sign_extend (freeze (load ...))) -> (freeze (load sext ...))
in a general case:
Freeze followed by sign-extend guarantees all the extended bits are equal. A sign-extending load followed by a freeze does not guarantee that.
On the other hand, such transformation is actually correct for BPF target, because jits do guarantee that sign extending loads would produce same extended bits.
So, two ways forward:
(sign_extend (freeze (load ...))) -> (freeze (load sext ...))
;freeze
is introduced when C example shared in the issue description is optimized.Here is a minimized C repro:
$ cat t.c
int foo(char *buf) {
char current_char;
current_char = *buf;
if (current_char == 13) {
return 1;
} else if (current_char >= 64 || current_char == 32 || current_char == 46) {
return 2;
}
return 0;
}
$ clang -O2 --target=bpf -S test5.c -o -
...
# %bb.0: # %entry
w0 = 1
w1 = *(u8 *)(r1 + 0)
w2 = w1
w2 &= 255 ; <---- not really necessary
...
I'll spend some time on (2).
For the above comments:
Freeze followed by sign-extend guarantees all the extended bits are equal.
. A sign-extending load followed by a freeze does not guarantee that. On the other hand, such transformation is actually correct for BPF target, because jits do guarantee . that sign extending loads would produce same extended bits.
I checked llvm asm reference at https://llvm.org/docs/LangRef.html#freeze-instruction IIUC, 'freeze' intends to stop/freeze undefined/poisoned values. So 'freeze (load nullptr)' will return a value generated by the compiler and 'load (freeze nullptr)' will be equals to 'load nullptr'.
In BPF program, do we really care 'freeze'? Maybe we simply ignore freeze during lowering? Based replace '%.fr = freeze i8 %0' with '%.fr = %0'?
It is still good to know why llvm generate 'freeze' in you above C example. AFAIK, in the above example, 'freeze' is introduced by SimplifyCFG.
The BPF backend seems to be adding bitmasks (
&= 255
in this case, foru8
) after loads, even though we get zero extension for free from loads.See line 9 of the output for 18.1.0 in https://godbolt.org/z/K5M6MhWxs
I
git bisect
ed the introduction of the problem to this commit: https://github.com/llvm/llvm-project/commit/651e644595b72c22fd22f51358cf083146790ed4@eddyz87