Open eaecbb65-864a-4d9b-ab01-204400607207 opened 6 years ago
Wow... so AVX and regcall/vectorcall is SERIOUSLY messed up for all 4 platforms that they are supported on. AVX512 types are being sent/returned in registers for all 4, nothing is being properly checked, and constraints are hardcoded everywhere.
All 4 platforms are in desperate need of cleanup for the ABInfo classes.
I checked into this, and it seems that (at least on Linux-x86-64) the ABI limitation of 512 bits is being enforced in clang in lib/CodeGen/TargetInfo.cpp. I'm not sure if a similar problem is going to happen on the other 3 platforms (Win x86/Win64/Lin x86), but this issue will likely need to be fixed on those as well.
Additionally, the vector size is likely going to need to be check against the AVX level. For example, attribute((regcall)) void f_512(Vector2_512 x); should likely be passed by pointer in AVX/AVX2 modes (where 512 bit registers arent available), but passed by registers in AVX512f mode.
It is on my personal backlog, but I'll leave it as 'new' unless someone wants to grab it from me.
Extended Description
Dear LLVM/Clang developers,
the 'regcall' calling convention currently does not match the behavior of the Intel compiler when working with values that use AVX512-style ZMM registers. This is problematic for two reasons:
'regcall' functions compiled with Clang and ICPC are not ABI-compatible.
The 'regcall' calling convention is used to improve performance in vectorized code involving function calls. It currently seems impossible to obtain this benefit when using Clang & AVX512.
How to reproduce:
Consider the following simple snippet which has 256 and 512 bit versions of a function that just forwards its argumetns.
/// ---------------------
include
struct Vector2_256 { __m256 x[2]; }; struct Vector2_512 { __m512 x[2]; };
attribute((regcall)) void f_256(Vector2_256 x); attribute((regcall)) void f_512(Vector2_512 x);
attribute((regcall)) void call_f_256(Vector2_256 x) { return f_256(x); } attribute((regcall)) void call_f_512(Vector2_512 x) { return f_512(x); }
/// ---------------------
With Clang trunk, this compiles to
$ clang++ test.cpp -march=skx -S -o - -O3 -fomit-frame-pointer
(with minor cleanups)
Z22regcall3__call_f_25611Vector2_256: jmp Z17regcall3__f_25611Vector2_256
Z22regcall3__call_f_51211Vector2_512: pushq %rbp movq %rsp, %rbp pushq %rsp andq $-64, %rsp subq $192, %rsp vmovaps 16(%rbp), %zmm0 vmovaps 80(%rbp), %zmm1 vmovaps %zmm1, 64(%rsp) vmovaps %zmm0, (%rsp) vzeroupper callq Z17regcall3__f_51211Vector2_512 leaq -8(%rbp), %rsp popq %rsp popq %rbp retq
In other words, the 256 bit version is correct, while the 512-bit version fetches the arguments and passes them on the stack (which it shouldn't do.)
On ICPC, I get
_Z22regcall2call_f_25611Vector2_256: jmp _Z17regcall2f_25611Vector2_256
_Z22regcall2call_f_51211Vector2_512: jmp _Z17regcall2f_51211Vector2_512