Open nirvedhmeshram opened 2 weeks ago
Based on some initial investigation the issue comes down to the trucation instructions we make see here
taf
does tuncation using
v_pk_mul_f32 v[0:1], v[20:21], v[0:1]
s_nop 0
v_cvt_f16_f32_e32 v12, v1
v_cvt_f16_f32_e32 v13, v0
due to generating vector llvm ops as input from mlir but the other two that make scalar llvm ops use
v_fma_mixlo_f16 v8, v8, v9, 0
based on the numerics the taf
output matches ground truth and in the other two a first use of v_fma_mixlo_f16
also seems to match groundtruth but then when a second v_fma_mixlo_f16
is used with the same destination register we observe the numeric issue.
I can confirm that the issue is with the zeroing semantics of v_fma_mixlo_f16
which is basically acknowledged here https://github.com/llvm/llvm-project/blob/ac0f64f06d67a93817ccd9a3c529ad40920115c9/llvm/lib/Target/AMDGPU/SIInstructions.td#L2835-L2843
Since this is not stable we can disable the use of mixed precsison fma instructions in IREE with
features += "-fma-mix-insts";
This way we can have the following instruction
v_mul_f32_e32 v8, v8, v9
v_cvt_f16_f32_e32 v8, v8
which is correct for mi300,
Ideally there needs to be an extra zeroing instruction after the v_fma_mixlo_f16 v8, v8, v9, 0
that the backend needs to generate but I dont think its much of a performance hit to just use v_mul_f32_e32 v8, v8, v9
For this elementwise + pad dispatch
there were numeric issues between all three of these pipelines/lowering configs
We will refer to these as
vec
,taf
andtaf_single
respectively going forward in the issue and the provided artifacts.This gist provides annotated IRs (with lowering/translation info) that you can then compile with the following commands to get vmfbs to do the numeric comparisons
Next you can generate your own input files and corresponding ground truth with these python scripts
Next. you can use these run commands
The splat ones have no issues, and you can use this comparison script to see issues with non-splat
You can see the numeric discrepancies I saw here