BPF backend adding unnecessary bitmasks after loads

brycekahle commented 2 weeks ago

The BPF backend seems to be adding bitmasks (&= 255 in this case, for u8) after loads, even though we get zero extension for free from loads.

See line 9 of the output for 18.1.0 in https://godbolt.org/z/K5M6MhWxs

I git bisected the introduction of the problem to this commit: https://github.com/llvm/llvm-project/commit/651e644595b72c22fd22f51358cf083146790ed4

@eddyz87

eddyz87 commented 2 weeks ago

Thank you for the report. I reduced example to the following:

$ cat reduced.ll
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "bpf"

define i32 @foo() {
entry:
  %0 = load i8, ptr null, align 1
  %.fr = freeze i8 %0
  %cmp = icmp eq i8 %.fr, 1
  %cmp3 = icmp sgt i8 %.fr, 0
  %or.cond = or i1 %cmp, %cmp3
  br i1 %or.cond, label %cleanup, label %switch.early.test

switch.early.test:                                ; preds = %entry
  %switch.selectcmp.case1 = icmp eq i8 0, 0
  br label %cleanup

cleanup:                                          ; preds = %switch.early.test, %entry
  ret i32 0
}

$ llc --mtriple=bpf -mcpu=v1 -o - reduced.ll
    .text
    .file   "reduced.ll"
    .globl  foo                             # -- Begin function foo
    .p2align    3
    .type   foo,@function
foo:                                    # @foo
    .cfi_startproc
# %bb.0:                                # %entry
    r1 = 0
    r1 = *(u8 *)(r1 + 0)
    r2 = r1
    r2 &= 255
    if r2 == 1 goto LBB0_2
# %bb.1:                                # %entry
    r1 <<= 56
    r1 s>>= 56
    if r1 s> 0 goto LBB0_2
LBB0_2:                                 # %cleanup
    r0 = 0
    exit
.Lfunc_end0:
    .size   foo, .Lfunc_end0-foo
    .cfi_endproc
                                        # -- End function

And llvm-reduce refuses to simplify it any further. Will take a look.

eddyz87 commented 2 weeks ago

@yonghong-song , fyi

eddyz87 commented 1 week ago

An update so far. The culprit is the freeze instruction. Had to dig up this paper to figure out what it is and why is it lowered from selection dag to machine instructions as a register copy (in SelectionDAGISel::Select_FREEZE()).

If I remove freeze from the ll example above:

-  %0 = load i8, ptr null, align 1
-  %.fr = freeze i8 %0
+  %.fr = load i8, ptr null, align 1

Instruction selection log looks as follows.

``` $ llc -debug-only=isel --mtriple=bpf -mcpu=v2 -o - < ~/work/tmp/unnecessary-bitmask.ll Initial selection DAG: %bb.0 'foo:entry' SelectionDAG has 18 nodes: t0: ch,glue = EntryToken t3: i8,ch = load<(load (s8) from `ptr null`)> t0, Constant:i64<0>, undef:i64 t6: i1 = setcc t3, Constant:i8<1>, seteq:ch t9: i1 = setcc t3, Constant:i8<0>, setgt:ch t10: i1 = or t6, t9 t11: i64 = sign_extend t3 t13: ch = CopyToReg t0, Register:i64 %0, t11 t15: ch = brcond t13, t6, BasicBlock:ch t17: ch = br t15, BasicBlock:ch Optimized lowered selection DAG: %bb.0 'foo:entry' SelectionDAG has 12 nodes: t0: ch,glue = EntryToken t18: i64,ch = load<(load (s8) from `ptr null`), sext from i8> t0, Constant:i64<0>, undef:i64 t13: ch = CopyToReg t0, Register:i64 %0, t18 t21: ch = br_cc t13, seteq:ch, t18, Constant:i64<1>, BasicBlock:ch t17: ch = br t21, BasicBlock:ch Type-legalized selection DAG: %bb.0 'foo:entry' SelectionDAG has 12 nodes: t0: ch,glue = EntryToken t18: i64,ch = load<(load (s8) from `ptr null`), sext from i8> t0, Constant:i64<0>, undef:i64 t13: ch = CopyToReg t0, Register:i64 %0, t18 t21: ch = br_cc t13, seteq:ch, t18, Constant:i64<1>, BasicBlock:ch t17: ch = br t21, BasicBlock:ch Legalized selection DAG: %bb.0 'foo:entry' SelectionDAG has 15 nodes: t0: ch,glue = EntryToken t24: i64,ch = load<(load (s8) from `ptr null`), anyext from i8> t0, Constant:i64<0>, undef:i64 t28: i64 = shl t24, Constant:i64<56> t29: i64 = sra t28, Constant:i64<56> t13: ch = CopyToReg t0, Register:i64 %0, t29 t23: ch = BPFISD::BR_CC t13, t29, Constant:i64<1>, Constant:i64<17>, BasicBlock:ch t17: ch = br t23, BasicBlock:ch ```

Note the transformation chain:

Initial selection DAG:
t3: i8,ch = load<(load (s8) from `ptr null`)> t0, Constant:i64<0>, undef:i64
...
t11: i64 = sign_extend t3
t13: ch = CopyToReg t0, Register:i64 %0, t11

Optimized lowered selection DAG:
  t18: i64,ch = load<(load (s8) from `ptr null`), sext from i8> t0, Constant:i64<0>, undef:i64
                                                  ^^^^
      t13: ch = CopyToReg t0, Register:i64 %0, t18

Legalized selection DAG:
      t24: i64,ch = load<(load (s8) from `ptr null`), anyext from i8> t0, Constant:i64<0>, undef:i64
                                                      ^^^^^^
    t28: i64 = shl t24, Constant:i64<56>
  t29: i64 = sra t28, Constant:i64<56>
      t13: ch = CopyToReg t0, Register:i64 %0, t29

Essentially, this matches expression (sign_extend (load ...)), replaces it with (load sext ...) and finally replaces that with (load anyext ...).

However, in presence of freeze the expression looks as (sign_extend (freeze (load ...))) and DAG optimization pattern (sign_extend (load ...)) -> (load sext ...) no longer applies. Hence, sign_extend operation remains in place and is lowered to machine instructions as &= 255. As discussed here, it is not correct to convert (sign_extend (freeze (load ...))) -> (freeze (load sext ...)) in a general case:

Freeze followed by sign-extend guarantees all the extended bits are equal. A sign-extending load followed by a freeze does not guarantee that.

On the other hand, such transformation is actually correct for BPF target, because jits do guarantee that sign extending loads would produce same extended bits.

So, two ways forward:

introduce BPF specific DAG optimization (sign_extend (freeze (load ...))) -> (freeze (load sext ...));
figure out why freeze is introduced when C example shared in the issue description is optimized.

Here is a minimized C repro:

$ cat t.c
int foo(char *buf) {
    char current_char;

    current_char = *buf;
    if (current_char == 13) {
      return 1;
    } else if (current_char >= 64 || current_char == 32 || current_char == 46) {
      return 2;
    }
    return 0;
}
$ clang -O2 --target=bpf -S test5.c -o - 
  ...
# %bb.0:                                # %entry
    w0 = 1
    w1 = *(u8 *)(r1 + 0)
    w2 = w1
    w2 &= 255 ; <---- not really necessary
  ...

I'll spend some time on (2).

yonghong-song commented 5 days ago

For the above comments:

Freeze followed by sign-extend guarantees all the extended bits are equal. 
. A sign-extending load followed by a freeze does not guarantee that. On the other hand, such transformation is actually correct for BPF target, because jits do guarantee . that sign extending loads would produce same extended bits.

I checked llvm asm reference at https://llvm.org/docs/LangRef.html#freeze-instruction IIUC, 'freeze' intends to stop/freeze undefined/poisoned values. So 'freeze (load nullptr)' will return a value generated by the compiler and 'load (freeze nullptr)' will be equals to 'load nullptr'.

In BPF program, do we really care 'freeze'? Maybe we simply ignore freeze during lowering? Based replace '%.fr = freeze i8 %0' with '%.fr = %0'?

It is still good to know why llvm generate 'freeze' in you above C example. AFAIK, in the above example, 'freeze' is introduced by SimplifyCFG.

llvm / llvm-project

BPF backend adding unnecessary bitmasks after loads #110159