llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.9k stars 11.94k forks source link

AMDGPU GlobalISel i64 multiply overuses v_mad_u64_u32 #100444

Open arsenm opened 3 months ago

arsenm commented 3 months ago

64-bit and wider multiply over-uses v_mad_u64_u32.

; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s | FileCheck -check-prefix=GISEL %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s | FileCheck -check-prefix=SDAG %s

; GISEL: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GISEL: v_mov_b32_e32 v4, v0
; GISEL: v_mov_b32_e32 v5, v1
; GISEL: v_mad_u64_u32 v[0:1], s[4:5], v4, v2, 0
; GISEL: v_mad_u64_u32 v[3:4], s[4:5], v4, v3, v[1:2]
; GISEL: v_mad_u64_u32 v[1:2], s[4:5], v5, v2, v[3:4]
; GISEL: s_setpc_b64 s[30:31]

; SDAG: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SDAG: v_mul_lo_u32 v4, v1, v2
; SDAG: v_mul_lo_u32 v3, v0, v3
; SDAG: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, 0
; SDAG: v_add3_u32 v1, v1, v3, v4
; SDAG: s_setpc_b64 s[30:31]
define i64 @mul64(i64 %a, i64 %b) {
  %mul = mul i64 %a, %b
  ret i64 %mul
}

There should only be one v_mad_u64_u32. The first one with the 0 add input is a simply mul_lo.

We have a custom lowering of multiply which tries to use v_mad_u64_u32, which is different from how the DAG handles this. The DAG relies on the default expansion and then reassembles the v_mad_u64_u32 later (and only custom lowers to select the scalar version when applicable). I was working on improving the default expansion in #97194 but that needs more work

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

64-bit and wider multiply over-uses v_mad_u64_u32. ``` ; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s | FileCheck -check-prefix=GISEL %s ; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -o - %s | FileCheck -check-prefix=SDAG %s ; GISEL: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; GISEL: v_mov_b32_e32 v4, v0 ; GISEL: v_mov_b32_e32 v5, v1 ; GISEL: v_mad_u64_u32 v[0:1], s[4:5], v4, v2, 0 ; GISEL: v_mad_u64_u32 v[3:4], s[4:5], v4, v3, v[1:2] ; GISEL: v_mad_u64_u32 v[1:2], s[4:5], v5, v2, v[3:4] ; GISEL: s_setpc_b64 s[30:31] ; SDAG: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; SDAG: v_mul_lo_u32 v4, v1, v2 ; SDAG: v_mul_lo_u32 v3, v0, v3 ; SDAG: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, 0 ; SDAG: v_add3_u32 v1, v1, v3, v4 ; SDAG: s_setpc_b64 s[30:31] define i64 @mul64(i64 %a, i64 %b) { %mul = mul i64 %a, %b ret i64 %mul } ``` There should only be one v_mad_u64_u32. The first one with the 0 add input is a simply mul_lo. We have a custom lowering of multiply which tries to use v_mad_u64_u32, which is different from how the DAG handles this. The DAG relies on the default expansion and then reassembles the v_mad_u64_u32 later (and only custom lowers to select the scalar version when applicable). I was working on improving the default expansion in #97194 but that needs more work
jayfoad commented 3 months ago

Given that v_mad_u64_u32 is as fast as v_mul_lo_u32 (at least on GFX10+) it's not clear which sequence is better. The gisel sequence is fewer instructions (ignoring the v_movs to get the inputs in the right place) but probably has longer latency because each mad depends on the previous one.

arsenm commented 3 months ago

Given that v_mad_u64_u32 is as fast as v_mul_lo_u32 (at least on GFX10+) it's not clear which sequence is better. The gisel sequence is fewer instructions (ignoring the v_movs to get the inputs in the right place) but probably has longer latency because each mad depends on the previous one.

It's also using wider registers and a bigger encoding

jayfoad commented 3 months ago

and a bigger encoding

Huh? v_mul_lo_u32 is VOP3, just like v_mad_u64_u32.

arsenm commented 3 months ago

and a bigger encoding

Huh? v_mul_lo_u32 is VOP3, just like v_mad_u64_u32.

Weird, it could/should be VOP2