riscv-non-isa / riscv-c-api-doc

Documentation of the RISC-V C API
https://lf-riscv.atlassian.net/browse/RVG-4
Creative Commons Attribution 4.0 International
75 stars 41 forks source link

Intrinsics support for Zihintntl extension #30

Open aswaterman opened 2 years ago

aswaterman commented 2 years ago

The Zihintntl extension has recently passed AR. The spec is here: https://github.com/riscv/riscv-isa-manual/blob/10eea63205f371ed649355f4cf7a80716335958f/src/zihintntl.tex

During the AR, we wanted to raise the issue of whether and how the extension would be exposed in the RISC-V C API. Can y'all ponder the following and opine?

In x86, for example, _mm_stream_pi (https://github.com/gcc-mirror/gcc/blob/e75da2ace6b6f634237259ef62cfb2d3d34adb10/gcc/config/i386/xmmintrin.h#L1279-L1291) is roughly equivalent to c.ntl.all; sd in RISC-V.

(ARMv8 has LDNP/STNP instructions, but I couldn't find an intrinsics mapping for them.)

Zihintntl is more general than x86's solution in a few dimensions:

With this in mind, the questions for the RISC-V C API folks are: how do we expose this facility in the RISC-V C API? How much of its generality do we expose? Do you foresee any impediments?

cc @kito-cheng @ptomsich

cmuellner commented 2 years ago

The specification requires the memory access to be the "immediately subsequent instruction". This is hard/impossible to guarantee by an intrinsic that does not include the memory access operation. Therefore I would include the memory access in the API.

I see two possible solutions.

Proposal one: provide NTL loads and stores (similar to the atomics builtins):

 type __riscv_ntl_load (type *ptr, enum ntl_domain domain);
void __riscv_ntl_store (type *ptr, enum ntl_domain domain);

However, this will probably only work reasonably well for single-GPR/FPR-memory transfers. E.g. vector memory accesses probably need a _ntl variant of the vector load/store intrinsics.

Proposal two: no intrinsics, but a NTL function attribute that does not block inlining

// all memory accesses in this function emit an NTL hint before the memory access instruction
__attribute__((target("ntl_domain=DOMAIN"))
static inline void
my_read_buf (uint8_t* buf, size_t n_bytes, uint8_t *src)
{
    __builtin_memcpy(buf, src, n_bytes);
}
aswaterman commented 2 years ago

I was also envisioning the intrinsic would emit the load or store in addition to the HINT. This matches how the non-temporal store intrinsics work on x86: the intrinsic actually performs the store, rather than annotating a separate assignment.

kito-cheng commented 2 years ago

Proposal one: provide NTL loads and stores (similar to the atomics builtins):

That sounds good to me.

Proposal two: no intrinsics, but a NTL function attribute that does not block inlining

I don't like idea of function attribute approach but that inspire me another possible solution for that: variable attribute:

uint8_t* buf __attribute__ ((ntl_doman=DOMAIN));

and any load store with pointer with this attribute will add a hint instruction.

aswaterman commented 2 years ago

I like the pointer attribute approach, if it's feasible to implement.

And of course the x86-style intrinsic can be implemented using the pointer attribute approach with a simple wrapper function.

topperc commented 2 years ago

X86 has MOVNTI instruction for non-temporal store of GPR as part of SSE2. X86 has MOVNTDQA for non-temporal vector load.

cmuellner commented 2 years ago

I don't like idea of function attribute approach but that inspire me another possible solution for that: variable attribute:

uint8_t* buf __attribute__ ((ntl_doman=DOMAIN));

and any load store with pointer with this attribute will add a hint instruction.

Yes, that's a better idea than a function attribute!

kito-cheng commented 2 years ago

I like the pointer attribute approach, if it's feasible to implement.

We need to make sure the implement effort on both compiler for variable attribute, I saw load/store in LLVM IR has encode non-temporal, but we might need to extend that to able to express different domain, so I think we need introduce new intrinsic for NTL load/store at first stage.

https://llvm.org/docs/LangRef.html#load-instruction

<result> = load [volatile] <ty>, ptr <pointer>[, align <alignment>][, !nontemporal !<nontemp_node>][, !invariant.load !<empty_node>][, !invariant.group !<empty_node>][, !nonnull !<empty_node>][, !dereferenceable !<deref_bytes_node>][, !dereferenceable_or_null !<deref_bytes_node>][, !align !<align_node>][, !noundef !<empty_node>]
<result> = load atomic [volatile] <ty>, ptr <pointer> [syncscope("<target-scope>")] <ordering>, align <alignment> [, !invariant.group !<empty_node>]
!<nontemp_node> = !{ i32 1 }
!<empty_node> = !{}
!<deref_bytes_node> = !{ i64 <dereferenceable_bytes> }
!<align_node> = !{ i64 <value_alignment> }
aswaterman commented 2 years ago

In any case, it seems we have a path to some solution.

kito-cheng commented 2 years ago

SiFive folks is implementing builtin now.

ptomsich commented 1 year ago

Looks like we should also wire this up to the storent-optab.

Here's the equivalent patterns for x86:

; Expand patterns for non-temporal stores.  At the moment, only those
; that directly map to insns are defined; it would be possible to
; define patterns for other modes that would expand to several insns.

;; Modes handled by storent patterns.
(define_mode_iterator STORENT_MODE
  [(DI "TARGET_SSE2 && TARGET_64BIT") (SI "TARGET_SSE2")
   (SF "TARGET_SSE4A") (DF "TARGET_SSE4A")
   (V8DI "TARGET_AVX512F") (V4DI "TARGET_AVX") (V2DI "TARGET_SSE2")
   (V16SF "TARGET_AVX512F") (V8SF "TARGET_AVX") V4SF
   (V8DF "TARGET_AVX512F") (V4DF "TARGET_AVX") (V2DF "TARGET_SSE2")])

(define_expand "storent<mode>"
  [(set (match_operand:STORENT_MODE 0 "memory_operand")
        (unspec:STORENT_MODE
          [(match_operand:STORENT_MODE 1 "register_operand")]
          UNSPEC_MOVNT))]
  "TARGET_SSE")
kito-cheng commented 1 year ago

Proposal for the intrinsic: https://github.com/riscv-non-isa/riscv-c-api-doc/pull/47