Open aswaterman opened 2 years ago
The specification requires the memory access to be the "immediately subsequent instruction". This is hard/impossible to guarantee by an intrinsic that does not include the memory access operation. Therefore I would include the memory access in the API.
I see two possible solutions.
Proposal one: provide NTL loads and stores (similar to the atomics builtins):
type __riscv_ntl_load (type *ptr, enum ntl_domain domain);
void __riscv_ntl_store (type *ptr, enum ntl_domain domain);
However, this will probably only work reasonably well for single-GPR/FPR-memory transfers. E.g. vector memory accesses probably need a _ntl variant of the vector load/store intrinsics.
Proposal two: no intrinsics, but a NTL function attribute that does not block inlining
// all memory accesses in this function emit an NTL hint before the memory access instruction
__attribute__((target("ntl_domain=DOMAIN"))
static inline void
my_read_buf (uint8_t* buf, size_t n_bytes, uint8_t *src)
{
__builtin_memcpy(buf, src, n_bytes);
}
I was also envisioning the intrinsic would emit the load or store in addition to the HINT. This matches how the non-temporal store intrinsics work on x86: the intrinsic actually performs the store, rather than annotating a separate assignment.
Proposal one: provide NTL loads and stores (similar to the atomics builtins):
That sounds good to me.
Proposal two: no intrinsics, but a NTL function attribute that does not block inlining
I don't like idea of function attribute approach but that inspire me another possible solution for that: variable attribute:
uint8_t* buf __attribute__ ((ntl_doman=DOMAIN));
and any load store with pointer with this attribute will add a hint instruction.
I like the pointer attribute approach, if it's feasible to implement.
And of course the x86-style intrinsic can be implemented using the pointer attribute approach with a simple wrapper function.
X86 has MOVNTI instruction for non-temporal store of GPR as part of SSE2. X86 has MOVNTDQA for non-temporal vector load.
I don't like idea of function attribute approach but that inspire me another possible solution for that: variable attribute:
uint8_t* buf __attribute__ ((ntl_doman=DOMAIN));
and any load store with pointer with this attribute will add a hint instruction.
Yes, that's a better idea than a function attribute!
I like the pointer attribute approach, if it's feasible to implement.
We need to make sure the implement effort on both compiler for variable attribute, I saw load/store in LLVM IR has encode non-temporal, but we might need to extend that to able to express different domain, so I think we need introduce new intrinsic for NTL load/store at first stage.
https://llvm.org/docs/LangRef.html#load-instruction
<result> = load [volatile] <ty>, ptr <pointer>[, align <alignment>][, !nontemporal !<nontemp_node>][, !invariant.load !<empty_node>][, !invariant.group !<empty_node>][, !nonnull !<empty_node>][, !dereferenceable !<deref_bytes_node>][, !dereferenceable_or_null !<deref_bytes_node>][, !align !<align_node>][, !noundef !<empty_node>]
<result> = load atomic [volatile] <ty>, ptr <pointer> [syncscope("<target-scope>")] <ordering>, align <alignment> [, !invariant.group !<empty_node>]
!<nontemp_node> = !{ i32 1 }
!<empty_node> = !{}
!<deref_bytes_node> = !{ i64 <dereferenceable_bytes> }
!<align_node> = !{ i64 <value_alignment> }
In any case, it seems we have a path to some solution.
SiFive folks is implementing builtin now.
Looks like we should also wire this up to the storent
-optab.
Here's the equivalent patterns for x86:
; Expand patterns for non-temporal stores. At the moment, only those
; that directly map to insns are defined; it would be possible to
; define patterns for other modes that would expand to several insns.
;; Modes handled by storent patterns.
(define_mode_iterator STORENT_MODE
[(DI "TARGET_SSE2 && TARGET_64BIT") (SI "TARGET_SSE2")
(SF "TARGET_SSE4A") (DF "TARGET_SSE4A")
(V8DI "TARGET_AVX512F") (V4DI "TARGET_AVX") (V2DI "TARGET_SSE2")
(V16SF "TARGET_AVX512F") (V8SF "TARGET_AVX") V4SF
(V8DF "TARGET_AVX512F") (V4DF "TARGET_AVX") (V2DF "TARGET_SSE2")])
(define_expand "storent<mode>"
[(set (match_operand:STORENT_MODE 0 "memory_operand")
(unspec:STORENT_MODE
[(match_operand:STORENT_MODE 1 "register_operand")]
UNSPEC_MOVNT))]
"TARGET_SSE")
Proposal for the intrinsic: https://github.com/riscv-non-isa/riscv-c-api-doc/pull/47
The Zihintntl extension has recently passed AR. The spec is here: https://github.com/riscv/riscv-isa-manual/blob/10eea63205f371ed649355f4cf7a80716335958f/src/zihintntl.tex
During the AR, we wanted to raise the issue of whether and how the extension would be exposed in the RISC-V C API. Can y'all ponder the following and opine?
In x86, for example,
_mm_stream_pi
(https://github.com/gcc-mirror/gcc/blob/e75da2ace6b6f634237259ef62cfb2d3d34adb10/gcc/config/i386/xmmintrin.h#L1279-L1291) is roughly equivalent toc.ntl.all; sd
in RISC-V.(ARMv8 has LDNP/STNP instructions, but I couldn't find an intrinsics mapping for them.)
Zihintntl is more general than x86's solution in a few dimensions:
ntl.all
, AFAIK)With this in mind, the questions for the RISC-V C API folks are: how do we expose this facility in the RISC-V C API? How much of its generality do we expose? Do you foresee any impediments?
cc @kito-cheng @ptomsich