Open anmparal opened 4 years ago
The min_vector_width attribute is currently only used to indicate that the compiler should honor 512-bit intrinsics. There's a drop in CPU frequency on some CPUs when using those instructions so the default behavior is for the auto vectorizers to avoid them on those CPUs. The attribute disables this behavior.
Can you clarify why these loads being split is problematic beyond not matching the documentation? You used the word "atomic" in the bug title, but neither Intel nor AMD guaranteed atomic memory access for anything larger than 8 bytes except for cmpxchg16b.
assigned to @anmparal
Extended Description
Given the test:
1 #include
2 #include
3 #include
4
5 uint32_t read_128b(m128i *ptr)
6 {
7 m128i val = _mm_load_si128(ptr);
8 return ((uint32_t ) &val)[0]|
9 ((uint32_t ) &val)[1]|
10 ((uint32_t ) &val)[2]|
11 ((uint32_t ) &val)[3];
12 }
With clang version 12.0.0 (https://github.com/llvm/llvm-project.git 4eef14f9780d9fc9a88096a3cabd669bcfa02bbc 09/04/2020) the _mm_load_si128() is translated at '-O2 -msse2' to:
This is not in accordance with Ref. [0], which specifies:
Synopsis __m128i _mm_load_si128 (__m128i const* mem_addr)
include
Instruction: movdqa xmm, m128 CPUID Flags: SSE2
(Note: gcc-10.1.0 and icc.16.0.5.027b both generate a movdqa as expected).
The accesses at lines 8 thro' 11 cause the problematic 64-bit loads; modifying the code (see marker: '<<<') so that:
1 #include
2 #include
3 #include
4
5 uint32_t read_128b(__m128i ptr, uint8_t index) <<<
6 {
7 __m128i val = _mm_load_si128(ptr);
8 return ((uint32_t ) &val)[index]; <<<
9 }
we see that the _mm_load_si128() is translated to: movaps (%rdi), %xmm0 as expected. (Note: Per Ref. [1], movaps and movdqa are interchangeable).
The _mm_load_si128() builtin is defined in: clang/lib/Headers/emmintrin.h with attribute: min_vector_width(128)
define DEFAULT_FN_ATTRS attribute((always_inline, nodebug__, \
... /// ... /// This intrinsic corresponds to the VMOVDQA / MOVDQA instruction.
/// ...
static inline m128i DEFAULT_FN_ATTRS
_mm_load_si128(m128i const *p)
{
return *__p;
}
Per Ref. [2], "This attribute may be attached to a function and informs the backend that this function desires vectors of at least this width to be generated. ... This attribute is meant to be a hint to control target heuristics that may generate narrower vectors than what the target hardware supports." So, it is reasonable to expect that the vector pointed to by '__p' is always treated in its 128-bit entirety.
_mm_load_si128() is converted to the following optimal LLVM IR:
; Function Attrs: alwaysinline norecurse nounwind readonly uwtable define internal fastcc <2 x i64> @_mm_load_si128(<2 x i64> nocapture readonly %__p) unnamed_addr #2 { entry: %0 = load <2 x i64>, <2 x i64> %__p, align 16, !tbaa !2 ret <2 x i64> %0 }
The Function Integration/Inlining pass inlines this _mm_load_si128() body into read_128b():
%0 = load <2 x i64>, <2 x i64>* %ptr, align 16, !tbaa !2
However, (owing to the 32-bit accesses in the subsequent |-expression), the Combine redundant instructions pass converts this load to:
%1 = load i128, i128* %0, align 16, !tbaa !2
which, the X86 DAG->DAG Instruction Selection pass converts to:
%1:gr64 = MOV64rm %0:gr64, 1, $noreg, 0, $noreg :: \ (load 8 from %ir.0, align 16, !tbaa !2) %2:gr64 = MOV64rm %0:gr64, 1, $noreg, 8, $noreg :: \ (load 8 from %ir.0 + 8, align 16, !tbaa !2)
the problematic 64-bit loads.
Per Ref. [3]/Rationale: "Platforms may rely on volatile loads and stores of natively supported data width to be executed as single instruction. For example, in C this holds for an l-value of volatile primitive type with native hardware support, but not necessarily for aggregate types. The frontend upholds these expectations, which are intentionally unspecified in the IR. The rules above ensure that IR transformations do not violate the frontend’s contract with the language."
Thus, the LLVM IR generated for the loads and stores in a function with the attribute((min_vector_width(width))) that operate on vectors 'width'-wide should satisfy the properties:
a. at-least 'width'-wide b. marked 'volatile' (to prevent any subsequent phases from splitting them up)
Assuming that property-a is correctly maintained by the front-end; the problem reduces to ensuring that property-b holds.
Hand-modifying the generated LLVM IR:
define internal <2 x i64> @_mm_load_si128(<2 x i64> %__p) #2 { entry: ! %0 = load <2 x i64>, <2 x i64> %__p, align 16 ret <2 x i64> %0 }
--- 1,11 ---- define internal <2 x i64> @_mm_load_si128(<2 x i64> %__p) #2 { entry: ! %0 = load volatile <2 x i64>, <2 x i64> %__p, align 16 ret <2 x i64> %0 }
we see that the 'load volatile <2 x i64>' does get converted to a 'movdqa', as expected.
PS: the same issue is also seen with m256i, m512i and with _mm_store_si128(), ...
I need your input on which of the following directions to take to fix this issue:
Marking the load-stores in the intrinsic as volatile during LLVM IR generation.
(Under an option) prohibiting the Combine redundant instructions pass from modifying the vector-load load <2 x i64> into load i128.
Making X86 DAG->DAG Instruction Selection generate VMOVDQArm instead of two MOV64rm’s on load i128 for SSE.
References: