Any way to specify how the registers are store internally?

DenisYaroshevskiy commented 1 year ago

The proposal seems to be geared towards a seamless interraction between intrinsics and std::simd, so that you can fall back to intrinsics when the standard does not provide the tools you want.

This is awesome and I full heartedly support it.

However it is no way specified how exactly the values are represented, specificall non standard sizes. Can this be done as a note? Would be nice if this intrinsic code was portable between compilers, even if not in strictly standard way then at least in practice.

What can be doe here?

FYI: in eve,

less than smallest native (16 bytes x86, 8 bytes arm neon, vls sve - native size, ppc - 16). If you store the register as array, the first elements in that array correspond to first elements in the register.
= native - exactly native with the same order as in previous point
bigger than native - we do recursive data structure: of 2 equal halfs

Note1 - we do not support arbitrary sizes only powers of 2 Note2 - If I'm not mistaken our ppc tests are for a big endian infrastructure. I believe still works the same. Sorry - very rarely touch ppc, I can find out if helpful.

danieltowner commented 1 year ago

I don't think we really want to expose exactly how the registers are stored since that could easily vary from one implementation to another. In the Intel implementation we build on top of the compiler's own vector extensions (e.g., https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), so we just hand off all the storage to gcc or clang without caring how it handles it. GCC requires power-of-2, but clang is happy to use any size by simply using a multiple of full-sized registers, plus some remainder.

Maybe we should think about this the other way around. Rather than exposing the internal storage so that the more unusual intrinsics may be used on it, we should instead allow small intrinsic building blocks to be given to std::simd to be applied to the internal storage. This is how Intel's implementation works and it allows us to create new operations from intrinsics very easily. @mattkretz IIRC you have proposed something similar, if not identical, to what I set out below too, but I can't find the reference anywhere, so my apologies if I have restated something you have already said.

This has turned into something bigger than I originally set out to say, but I think it is useful to say it anyway.

Firstly, std::simd already has ways to convert to and from the implementation types for calling intrinsics. From https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p1928r4.pdf:

constexpr explicit operator implementation-defined() const; constexpr explicit simd(const implementation-defined& init);

using V = simd<float, simd_abi::__simd128>;
V addsub(V a, V b) {
  return static_cast<V>(_vec4f_addsub(static_cast<__vec4f>(a), static_cast<__vec4f>(b)));
}

For simd values which fit a native register, this allows intrinsics to be called directly. I actually take this a little further and provide named functions:

register-type simd::to_register() const; explicit simd::from_register(register-type)

I do this so that it makes it very clear what the intent of the code is, and also to ensure that those operators can statically assert that the data fits a native register. In all cases - my version and std::simd - the functions are always full native registers which may be partially filled. What happens in the extra elements that aren't used is undefined. My version always uses the smallest possible register too, so fixed_size_simd<float, 3>::to_register would used __m128.

Converting to and from a register works for small simd values, but simd values which span multiple native registers need a different approach. Here is an example of how we do this, starting with a lambda which handles one register-sized piece:

auto do_native_addsub = [](auto lhs, auto rhs, auto unused_idx) {
   auto t = _mm256_addsub_ps(lhs.to_register(), rhs.to_register());
   return decltype(lhs)::from_register(t);
};

and we then call it like this:

fixed_size_simd<float, 20> lhs, rhs;
...
auto r = apply_to_pieces(do_native_addsub, lhs, rhs);

The apply_to_pieces function breaks down its arguments lhs and rhs into pieces of native size, and then calls do_native_addsub on each set of respective pieces. The results of all the individual calls are then glued back together into the end result. The output will have the same size as the inputs, and the same type as the elements returned by the lambda. Some nuances of this:

If an argument to apply_to_pieces is a simd/simd_mask, it gets broken down into pieces.
If an argument is something other than a simd/simd_mask then every call to the invocable gets a copy of that thing (e.g., you could pass in a single pointer value which is passed through unaltered).
The invocable accepts an optional constant integral index argument which is used to determine the position of the piece in the original. This is the index of the first element in each piece (e.g., given a simd of size 20, and a native size of 8, then the indexes would be 0, 8, 16).
Each simd/simd_mask argument must have the same number of elements.
The invocable will normally be called with a simd/simd_mask which is the size of each piece, except for the remainder piece which will be different. The invocable thus knows how big the remainder is so that it can avoid side-effects from applying an intrinsic to undefined data.
Normally apply_to_pieces will break the problem down into pieces which are the native size of the first simd argument, but this can be overriden: apply_to_pieces<4>(do_something, values) works in pieces of 4 elements. This is useful as it allows the user to break the problem down into whatever size is most useful.
If the invocable returns a simd/simd_mask, all the results are glued together. If the invocable returns nothing, then apply_to_pieces returns nothing. I haven't thought about what happens in the invocable returns something else.
I haven't thought about whether it makes sense to allow an invocable such as a local lambda, to have state, and to expect to be called in a particular order. That could be useful sometimes, but not sure if it is useful enough to make it a requirement.
The invocable can be anything that can be called: lambda, overloaded templated function, etc.

My example above is therefore something equivalent to doing this:

auto r0 = do_native_addsub(extract<0, 8>(lhs), extract<0, 8>(rhs), 0);
auto r1 = do_native_addsub(extract<8, 16>(lhs), extract<8, 16>(rhs), 8);
auto r2 = do_native_addsub(extract<16, 20>(lhs), extract<16, 20>(rhs), 16);
auto result = concat(r0, r1, r2);

and the generated code might look like:

Originally I made the invocable have implementation-defined parameters (e.g., __m512i) but these lack information about what exactly is being passed in (i.e., 8 x uint64, or 64 x uint8_t), and whether it is a full or partial register, so I went with breaking it down into simd pieces instead. Users have found that this mechanism is useful in breaking other problems down into smaller pieces which can be processed separately, not just for calling intrinsics.

Thoughts and comments welcome.

DenisYaroshevskiy commented 1 year ago

There is a lot. Let's do in pieces.

If it's of a natively supported size: 16, 32, 64 on x86, I should be able to bist_cast to and from safely, correct?

if x is 16 chars

std::bit_cast<__m128i>(x);

is free and gives me a correct result, right?

And the element number 0 stays consistent

danieltowner commented 1 year ago

std::bit_cast<__m128i>(x); is free and gives me a correct result, right?

Agreed.

DenisYaroshevskiy commented 1 year ago

I'm thinking, I 100% will need to cast to native intrinsics to do things. If specifying what internals is something we don't want, can we specify a cast function?

simd_cast_to_native<intrinsic_register_type>(x);

And encourage the implementations to provide:

simd_cast_to_native<m128(i)>, simd_cast_to_native<__m256(i)>, simd_cast_to_native<m512(i)>

and similar to arm neon?

The implementation is encouraged to provide these if sizeof(element_type) * number_of_elements <= sizeof(intrinsic_register_type). The elements in a resulting register should be in the order as if you did a load of a native register.

Otherwise, I know I will be writing these functions with a bunch of platform dependent macros.

danieltowner commented 1 year ago

As I noted in my previous response, P1928 already provides a conversion operator and a constructor for working with implementation defined data. Also, I think an extension which provides to_register and from_register is useful for explicitly working with the compiler's vector types. Leaving aside naming, does that combination of functions satisfy your needs, or do you want something different?

DenisYaroshevskiy commented 1 year ago

As I noted in my previous response, P1928 already provides a conversion operator and a constructor for working with implementation defined data

It does but those do not tell me what they are at all. Which means in order to actually use it, I'd have to have a function that I imagine looks like

template <std::integral T, std::size_t N>
   requires (sizeof(T) * N <= 16)
__m128i cast_to(std::simd<T, std::simd_abi::fixed_ size<N>>, as<__m128i>) {
#if defined(__MSVC__) && !defined(AVX)
// convert msvc implementation
#elif defined(__GCC__)
// convert the gcc implementation
#endif
}

The implementation knows how to do this better than I do - I would like them to do it.

mattkretz commented 1 year ago

Hi, sorry for being late to the discussion. I admit I have not read everything in depth yet.

It does but those do not tell me what they are at all.

The standard cannot tell you. What are you asking for then? A member type? How would that help? And how would you do it for multiple types (e.g. simd<char, "SSE"> supports conversion to/from __m128i and [[gnu::vector_size(16)]] char)?

The implementation knows how to do this better than I do

It still doesn't necessarily know what you want. Again, the integer SSE types: the intrinsic type or the GCC vector type?

But in general I'm currently tending rather towards removing than adding stuff in this area. That's because std::simd and std::simd_mask will be unconditionally trivially copyable. So you can, independent of implementation-defined features, do:

simd<float> addsub(simd<float> lhs, simd<float> rhs) 
{
  if constexpr (sizeof(lhs) == sizeof(__m128))
    return std::bit_cast<simd<float>>(_mm_addsub_ps(std::bit_cast<__m128>(lhs),
                                                    std::bit_cast<__m128>(rhs)));
  else if constexpr (sizeof(lhs) == sizeof(__m256))
    return std::bit_cast<simd<float>>(_mm256_addsub_ps(std::bit_cast<__m256>(lhs),
                                                       std::bit_cast<__m256>(rhs)));
  // ...
}

What might be interesting though, is to provide a better "split (or up-size) to register-sized chunks" function. See P1928R5 Section 5.5 for a start.

mattkretz commented 1 year ago

Oh, and another point I wanted to add to your topic question:

The ABI tag specifies "how the registers are stored internally". The implementation can (and really should) add implementation-defined ABI tags. This implies the implementation has to document them. E.g. GCC/libstdc++ documents (well I should make it so, I suspect) simd_abi::_VecBuiltin<16> to store a simd as [[gnu::vector_size(16)]] internally, using the corresponding integer vector type for simd_mask.

danieltowner commented 1 year ago

Although there are potentially several types which can be used to define a register, I think std::simd itself should specify one register type which it thinks is the most appropriate type to use to call an intrinsic. In my implementation I currently do this as a member type in simd itself:

template<typename T, typename ABI>
class simd {
  // Could be __m512i, __m256h, __m128d, __v8sf, __v16qi, or whatever else is valid
  // in calls to an intrinsic on this target and compiler. It should be the smallest
  // option (e.g., __m128 for 16B or less, __m256 for 32B, etc.). For partial registers
  // (e.g., simd<float, 3>) it will give a whole register (e.g., __m128, not [[gnu::vector_size(12)]]).
  using register_type = /* impl-type */; 
};

I don't think that the member register type should be permitted for simd<> objects which are too big to fit a register, as that implies the data fits one register when the implementation might choose to use several discontiguous or mixed-size registers.

I can see that using std::bit_cast is a neat way to avoid having to put ctors or conversion operators into std::simd itself, but should we this for programmers to make it as easy as possible to call an intrinsic when they need to? I think we should keep the implementation defined constructor, but tweaked to:

explicit constexpr simd::simd(register_type r);

and to retrieve a value I always like named accessors like this:

simd::register_type to_register() const;

Providing an accessor ensures that the value is correctly retrieved out of the simd into a valid register for the target, it makes the intent of the code clear, and it ensures the best register type is chosen to interact with intrinsics. It then allows overloading to be used to select from several different options without complicated if-else conditionals being needed, which makes the code simpler:

__m128 intel_addsub(__m128 lhs, __m128 rhs) { return _mm_addsub_ps(lhs, rhs); }
__m256 intel_addsub(__m256 lhs, __m256 rhs) { return _mm256_addsub_ps(lhs, rhs); }
__m512 intel_addsub(__m512 lhs, __m512 rhs) { return _mm512_addsub_ps(lhs, rhs); }
__m128d intel_addsub(__m128d lhs, __m128d rhs) { return _mm_addsub_pd(lhs, rhs); }
__m256d intel_addsub(__m256d lhs, __m256d rhs) { return _mm256_addsub_pd(lhs, rhs); }
__m512d intel_addsub(__m512d lhs, __m512d rhs) { return _mm512_addsub_pd(lhs, rhs); }

typename <std::floating_point FP>
simd<FP> addsub(simd<FP> lhs, simd<FP> rhs) { 
  return simd<FP>(intel_addsub(lhs.to_register(), rhs.to_register()); 
}

Once we have agreed on what the minimal support is needed to invoke intrinsics with small simd objects, then we can separately address how to break big simd objects up in ways which allow the small calls to be invoked.

mattkretz commented 1 year ago

One reason why I tried to be very conservative is that experience with "native handle" functions in standard library types has been a source of problems and has been mentioned again and again as a "don't repeat that mistake again":

it's not going to fly easily through WG21; there will be push back
even though it doesn't return a reference it still bakes a certain type into the ABI; implementations can't easily innovate anymore

I'm sure it's not much work to write a user-defined non-member to_intrin(simd) function:

typename <std::floating_point FP>
simd<FP> addsub(simd<FP> lhs, simd<FP> rhs) { 
  return to_simd(intel_addsub(to_intrin(lhs), to_intrin(rhs)); 
}

By keeping this part of the standard vague the type can much better stick to being an abstraction of a data-parallel type, rather than a manifestation of a CPU register. :wink:

DenisYaroshevskiy commented 1 year ago

I'm sure it's not much work to write a user-defined non-member to_intrin(simd) function:

I think based on the platform it will be a lot of work, like a lot. Especially for non native register sizes.

One random way we can specify it is:

Function behaving as is

template <typename To, typename T, std::size_t N>
  requires std::trivially_default_contructible<T> && std::trivially_copiable<T> &&
                 (sizeof(T) * N <= sizeof(To))
 To simd_cast(std::simd<T, std::simd_abi::fixed<N>> x) {
    std::array<T, N> buf;
    x.copy_to(buf.data());
    To res; /*unspecified*/
    std::memcpy(&res, buf.data(), N);
    return res;
 }

(There is probably a UB somewhere but you get my point).

And then we encourage a specialized implementations for native intrinsic types that people might want to use.

danieltowner commented 1 year ago

One reason why I tried to be very conservative is that experience with "native handle" functions in standard library types has been a source of problems and has been mentioned again and again as a "don't repeat that mistake again".

I can see this is tricky, and having something abstract is cleaner and easier to deal with. But I think we also need to be pragmatic and accept that intrinsics and target interaction are inevitable, and we should make that as easy as possible. But if we can't get that view past the committee then we don't have any choice but to accept it.

At the moment the audience of Intel's simd implementation is experienced programmers who would normally use intrinsics, and are using std::simd for the easier syntax, but don't want to lose control over the power of the more unusual intrinsics that targets often have. It's the question that they raise again and again - how do they call the favourite intrinsic without inventing their own mechanism?

I agree that it is simple to have user-defined to_simd and to_intrin functions, but if we make the end user do that themselves we end up with copies of this utility function across many different code bases, and potentially with subtle issues in them. Or we allow the implementation to define these in which case we get a standard by stealth, where everyone knows that you can call those functions for a particular compiler, but they aren't actually in std::simd and their interaction with std::simd is undefined. That almost seems worse to me.

I like your point about innovation, and in an abstract way I agree that tying down the mechanism could be bad. But practically, is there any simd target which wouldn't have at least some basic level of being able to define a container (or register) for interacting with its intrinsics?

DenisYaroshevskiy commented 1 year ago

Maybe even

template <typename To>
  requires std::trivially_default_contructible<To> && std::trivially_copiable<To> &&
                 (sizeof(T) * N <= sizeof(To))
To simd_cast(std::simd<T, std::simd_abi::fixed<N>> x) {
  To res;
  x.copy_to((T*)&res);
  return res;
}

DenisYaroshevskiy commented 1 year ago

Another suggestion:

What if we just add a Note with suggested conversions? Like we cannot specify them formally but we can help people provide a good interface.

Smth like:

NOTE:

If simd::size() < sizeof(OutputType) we recommend to store the result as if:
```
std::array<T, std::max(N, sizeofOutputType / sizeof(T)> buf;
x.copy_to(buf.data());
return std::bit_cast<OutputType>(buf);
```
(what is in the remainging bits of the buf is unspecified and can be different each time). This helps users to know where the elements they need in the register
For x64 we recommend: if simd::size() <= 16 and std::isintegral -> convertible to and from m128i if simd::size() <= 16 and std::same_as<float, simd::value_type> -> convertible to and from m128 if simd::size() <= 16 and std::same_as<double, simd::value_type> -> convertible to and from __m128d

if current platfrom supports AVX and simd::size() * sizeof(simd::value_type) == 32 if std::isintegral -> convertible to and from m256i if std:same_as<float, simd::value_type> -> convertible to and from m256 if std::same_as<double, simd::value_type> -> convertible to and from __m256

So basically it's the same approach as now in the paper but we nodge them to do the thing that'd be useful.

mattkretz / std-simd-feedback

Any way to specify how the registers are store internally? #77