`frem` should tail-call `fmodf`

Eisenwave commented 3 months ago

float f(float x, float y) {
    return __builtin_fmodf(x, y);
}

With -O3, this emits

f(float, float):                                 # @f(float, float)
        jmp     fmodf@PLT                       # TAILCALL

define dso_local noundef float @f(float, float)(float noundef %x, float noundef %y) local_unnamed_addr {
entry:
  %call = tail call float @fmodf(float noundef %x, float noundef %y) #3
  ret float %call
}

With -O3 -ffast-math:

f(float, float):                                 # @f(float, float)
        push    rax
        call    fmodf@PLT
        pop     rax
        ret

define dso_local noundef nofpclass(nan inf) float @f(float, float)(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) local_unnamed_addr {
entry:
  %fmod = frem fast float %x, %y
  ret float %fmod
}

Is this intentional? Seems like a missed optimization. I suspect the problem is that the codegen for the frem instruction never results in a tail-call. The irony is that the codegen is just call fmodf, so the "naive" library call to fmodf is better than the one resulting from frem codegen.

The same code with __builtin_remainderf always tail-calls:

define dso_local noundef nofpclass(nan inf) float @f(float, float)(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) local_unnamed_addr {
entry:
  %call = tail call fast nofpclass(nan inf) float @remainderf(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) #3
  ret float %call
}

vfdff commented 3 months ago

the arm64 also has such issue https://godbolt.org/z/fsfYeeshx

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-x86

Author: Jan Schultke (Eisenwave)

https://godbolt.org/z/83cTWb7T6 ```cpp float f(float x, float y) { return __builtin_fmodf(x, y); } ``` With `-O3`, this emits ```asm f(float, float): # @f(float, float) jmp fmodf@PLT # TAILCALL ``` ```llvm define dso_local noundef float @f(float, float)(float noundef %x, float noundef %y) local_unnamed_addr { entry: %call = tail call float @fmodf(float noundef %x, float noundef %y) #3 ret float %call } ``` With `-O3 -ffast-math`: ```asm f(float, float): # @f(float, float) push rax call fmodf@PLT pop rax ret ``` ```llvm define dso_local noundef nofpclass(nan inf) float @f(float, float)(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) local_unnamed_addr { entry: %fmod = frem fast float %x, %y ret float %fmod } ``` Is this intentional? Seems like a missed optimization. I suspect the problem is that the codegen for the `frem` instruction never results in a tail-call. The irony is that the codegen is just `call fmodf`, so the "naive" library call to `fmodf` is better than the one resulting from `frem` codegen. The same code with `__builtin_remainderf` always tail-calls: ```llvm define dso_local noundef nofpclass(nan inf) float @f(float, float)(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) local_unnamed_addr { entry: %call = tail call fast nofpclass(nan inf) float @remainderf(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) #3 ret float %call } ```

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Jan Schultke (Eisenwave)

https://godbolt.org/z/83cTWb7T6 ```cpp float f(float x, float y) { return __builtin_fmodf(x, y); } ``` With `-O3`, this emits ```asm f(float, float): # @f(float, float) jmp fmodf@PLT # TAILCALL ``` ```llvm define dso_local noundef float @f(float, float)(float noundef %x, float noundef %y) local_unnamed_addr { entry: %call = tail call float @fmodf(float noundef %x, float noundef %y) #3 ret float %call } ``` With `-O3 -ffast-math`: ```asm f(float, float): # @f(float, float) push rax call fmodf@PLT pop rax ret ``` ```llvm define dso_local noundef nofpclass(nan inf) float @f(float, float)(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) local_unnamed_addr { entry: %fmod = frem fast float %x, %y ret float %fmod } ``` Is this intentional? Seems like a missed optimization. I suspect the problem is that the codegen for the `frem` instruction never results in a tail-call. The irony is that the codegen is just `call fmodf`, so the "naive" library call to `fmodf` is better than the one resulting from `frem` codegen. The same code with `__builtin_remainderf` always tail-calls: ```llvm define dso_local noundef nofpclass(nan inf) float @f(float, float)(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) local_unnamed_addr { entry: %call = tail call fast nofpclass(nan inf) float @remainderf(float noundef nofpclass(nan inf) %x, float noundef nofpclass(nan inf) %y) #3 ret float %call } ```

jhuber6 commented 3 months ago

As far as I know, fmod can set errno on a domain error which means that transforming fmod to frem is illegal. -ffast-math suppresses errno which means we can use the frem instruction. (However this is confusing since I'd assume the presence of the builtin would imply ignoring errno as well.)

The frem instruction is then lowered directly by the X86 backend, which doesn't include any of the tail call optimizations we get from the LLVM-IR passes. See the pipeline in https://godbolt.org/z/vK6EEqc4E. Likely someone just needs to improve the lowering.

RKSimon commented 3 months ago

CC @arsenm

arsenm commented 3 months ago

We currently have 2 paths for emitting libcalls, for some reason. TargetLibraryInfo::makeLibCall looks like it's not attempting to make tail calls when the call is routed through FREM. The Legalizer in SelectionDAGLegalize::ExpandLibCall does

llvm / llvm-project

`frem` should tail-call `fmodf` #92515