mattkretz / std-simd-feedback

Feedback from simd in the Parallelism TS 2
0 stars 0 forks source link

allow native<T> to be fixed_size<T, N>? #59

Closed mattkretz closed 1 year ago

mattkretz commented 1 year ago

@danieltowner After Issaquah fixed_size<T, simd<T>::size()> is equivalent to simd<T> in every aspect. But they are not the same type. It was necessary in the TS to be different, because they had different behavior wrt. conversions and ABI. I have a feeling we should allow implementations to implement fixed_size<T, 1> as scalar and have native<T> and fixed_size<T, simd<T>::size()> be the same type. But if we only allow it there may be portability issues between implementations. So mandate them to be the same? What do you think?

mattkretz commented 1 year ago

Note that on AVX-512 I still have (at least) two possible ABI tags for <float, 8>: using bit masks or vector masks. And <float, 16> could be zmm or 2 ymm registers in addition to different masks. So <T, N> isn't necessarily distinct even without changing compiler flags.

mattkretz commented 1 year ago

Ah, in other words, I was probably asking for renaming simd_abi::deduce_t<T, N> to simd_abi::fixed_size<T, N>. It may be strange for users to not see any mention of "fixed" in the resulting fixed_size_simd<T, N> type. There's another consideration, can T and N be deduced in the following declaration of f?

template <typename T, std::size_t N>
void f(std::fixed_size_simd<T, N> v);

With the TS it is required to. Whereas

template <typename T, std::size_t N>
void f(stdx::simd<T, stdx::simd_abi::deduce_t<T, N>> v);

will not deduce N.

mattkretz commented 1 year ago

Sorry, for my string of thoughts here. So why did we have this requirement in the TS for fixed_size? If fixed_size is an alias for all kinds of "native" / implementation-defined ABI tags, then why would anyone want to limit the set of ABI tags to fixed_size_simd? Makes no sense. If you want "any simd", deduce the ABI tag instead of the width. So let's also drop the requirement that fixed_size can be deduced.

danieltowner commented 1 year ago

Note that on AVX-512 I still have (at least) two possible ABI tags for <float, 8>: using bit masks or vector masks. And <float, 16> could be zmm or 2 ymm registers in addition to different masks. So <T, N> isn't necessarily distinct even without changing compiler flags.

We differ in our implementations then. When AVX-512 is enabled we only ever have exactly one implementation - using zmm registers with predicate registers.

Do you have multiple ABIs so that you can compile several translation units with different targets, but make them use the same ABI to allow cross translation-unit calls?

For high performance code - what I think of as the intended target of std::simd - we typically compile the entire software product with a single target, so every translation unit gets the same ABI and any function in the system can call any other function without caring about managing cross-translation-unit boundaries sharing a common ABI.

When we really do want a multi-target executable/library (e.g., fat binary), then we define the interfaces in terms of something that won't change based on different types of target simd support (e.g., pass memory pointers).

I can see that allowing multiple ABIs could allow greater flexibility, but does it make things too complicated?

danieltowner commented 1 year ago

@danieltowner After Issaquah fixed_size<T, simd<T>::size()> is equivalent to simd<T> in every aspect. But they are not the same type. It was necessary in the TS to be different, because they had different behavior wrt. conversions and ABI. I have a feeling we should allow implementations to implement fixed_size<T, 1> as scalar and have native<T> and fixed_size<T, simd<T>::size()> be the same type. But if we only allow it there may be portability issues between implementations. So mandate them to be the same? What do you think?

In the Intel implementation we think of everything as being a fixed size anyway. There is an underlying implementation class that manages a simd of a given size. When that size matches a native register size it will use a single register, otherwise it will use as many full or partial registers as needed. So for Intel's library there is no advantage to making a distinction between a native simd and a fixed_size<simd<T>::size> since they are identical underneath anyway and generate identical code. Making them explicitly the same type seems okay to me. Do you see any advantage to an implementation in having a native and a fixed_size of the native size be different types?

In theory I guess that means scalar is effectively an alias for fixed_size<T, 1>, but in practice scalar is broken in our implementation anyway and no one has ever complained about that omission.

mattkretz commented 1 year ago

Do you have multiple ABIs so that you can compile several translation units with different targets, but make them use the same ABI to allow cross translation-unit calls?

That is one reason to have them. The other is that there are performance trade-offs and it's not an obvious choice which one is the best. But using those ABI tags requires the use of implementation-defined types (which can be done in one central place of the user code).

I can see that allowing multiple ABIs could allow greater flexibility, but does it make things too complicated?

Using the standard API, users will basically get the experience that you implemented. This would really be an expert-level tuning knob.

In the Intel implementation we think of everything as being a fixed size anyway. There is an underlying implementation class that manages a simd of a given size. When that size matches a native register size it will use a single register, otherwise it will use as many full or partial registers as needed. [...]

That sounds like the names of your ABI tags don't necessarily reflect all possible differences in ABI? (Example simd_mask<float, _Abi<float, 8>> will produce the same name in all translation units even if compiled without AVX, or with AVX-512, where it has very different representation.) But I may be reading more into your answer than you meant to say.

Specifically for my implementation (not the TS implementation):

danieltowner commented 1 year ago

Note that on AVX-512 I still have (at least) two possible ABI tags for <float, 8>: using bit masks or vector masks. And <float, 16> could be zmm or 2 ymm registers in addition to different masks. So <T, N> isn't necessarily distinct even without changing compiler flags.

I was re-reading R4 draft and I still have trouble with this statement. If you had AVX-512, why would you want to restrict to ymm and avoid using the predicate register? If you wanted ymm only, compile for AVX or AVX2 instead.

mattkretz commented 1 year ago

The AVX-512 instruction set is more complete than AVX2, so if, for whatever reason, you want to avoid use of zmm registers, AVX-512 is still interesting. One reason to avoid zmm can be the down-clocking happening on Skylake (and less so on later Intel CPUs). If your SIMD code is interleaved with scalar code, the latter also becomes slower because the clock is lower. Same reason why compilers auto-vectorize to ymm even with AVX-512 available, I assume.

Avoiding the mask registers can be more efficient in the few cases where you need to turn the mask back into a value. This might be interesting for squeezing out the last cycle of an inner loop. Not a thing that's interesting in general.

Thing is, it simply falls out of the way I designed my ABIs: The name of the class template identifies vector mask vs. bit mask. The template parameter identifies how many Bytes the object needs. This is interesting because of the <float, 8> example above. Depending on compiler flags it's one or the other ABI. And they have to be different names according to the TS.

danieltowner commented 1 year ago

I think that I would treat the avoidance of zmm as a compiler command line option, rather than something that the programmer deals with in their source code.

clang is already good at replacing what it considers isolated poor use of mask registers with a vector equivalent, so I'm nor sure that you would want to do it across an entire code base.

mattkretz commented 1 year ago

Yes, and if you do that via compiler command line option the ABI tag must have a different name. So the two different ABI tags exist anyway. By not "hiding" any of the ABI tags that theoretically can work for a given set of compiler flags the developer has more freedom to do something special, if necessary.

:+1: for clang. GCC sometimes isn't as smart (yet). You also wouldn't have to use it across an entire code base. Stupid example:

simd<float> x = ...
#ifdef __GLIBCXX__
const simd<float, simd_abi::_VecBuiltin<sizeof(x)>> y = x;
sum += y < 0;
#else
sum += x < 0;
#endif
// continue on with x

Think of it as an intermediate step before resorting to intrinsics or inline asm. Close to nobody will do this/think of this.

danieltowner commented 1 year ago

I'm okay with the target being encoded into the simd name, so as you note the implementation can map to different type encodings which capture that information.

The ABIs you choose for your implementation are going to be different to the ones that I choose for Intel's implementation, so as soon as the user starts writing code that uses those special ABI names you've lost portability. I'm not sure that allowing the programmer to assume the existence of, or use, what are (to me at least) implementation details is useful.

The intel implementation is more restrictive than yours because our implementation splits the responsibilities: the programmer chooses the size of simd they want (where native size is an alias for the compiler's preference), and the compiler figures out how to deliver that size for the given target. We don't give the programmer the option to override the compiler's choice of ABI for its implementation, other than by changing the command line target options (e.g., obviously through -march, but also potentially through tweaks like the mprefer-vector-width command line option), so within a single translation unit all that can change is size and type. No one in either our internal or external user base has asked for more control than this, even from programmers who have been developing simd code for decades.

If we said that the implementation of a given size of simd is decided entirely by the compiler based upon the target and any additional hints, then that means that within a translation unit, the only thing the user and library care about is type and size. Any use of a simd type within a single translation unit will be consistent, and we don't have to worry about what happens when we try to change from one ABI to another within the same code. This in turn means we don't need the complications of resize_t, rebind, deduce, and so on since everything is just a fixed_size_simd<T, N>.

mattkretz commented 1 year ago

The ABIs you choose for your implementation are going to be different to the ones that I choose for Intel's implementation

Sure. "implementation-defined" however also means that implementations must document these. So users can make informed choices. There's also a possibility that implementations agree on the names of ABI tags. Similar to many other GCC & Clang extensions.

We don't give the programmer the option to override the compiler's choice of ABI for its implementation

Do I understand correctly, that you do have different names for different -m flags, but hide the definitions/declarations of them depending on -m flags? Because otherwise your implementation effectively says "mixing TUs compiled with different -m flags is UB" (which it is by the letter of the standard, but implementations typically try to make it work anyway).

danieltowner commented 1 year ago

Actually the implementation doesn't currently encode the ABI, but for performance code we recommend you compile all TU's with the best possible target option anyway, so their ABIs will all match anyway. It's no big deal to add target specific ABI flag though.

Other than the example you already gave (which you say is unlikely to ever be used anyway), why else would you want to mix and match ABIs within a single TU?

What I'm really suggesting is that if we restricted each TU to have only one ABI then it simplifies things.

mattkretz commented 1 year ago

I don't care so much for "mix and match ABIs within a single TU", but for linking TUs compiled with different -m flags. Which mixes and matches at the Linker level. With the goal of supporting fat binaries that dispatch after reading cpuid. If you don't keep this in mind while designing your ABI tags (and also all your inline functions, which the compiler might turn into a function call) there's a good chance the linker will happily link and throw away duplicate symbols which were compiled for different ISAs and potentially also link interfaces that don't really fit.

So, my point is that it's not enough to think of a single TU. You have to think of shared & static libraries, and many TUs all compiled at different places with different flags linked together and not blowing up. Once the simd type is common vocabulary for all C++ programs there's a good chance it will appear all over the place. And there will be binary distribution...

danieltowner commented 1 year ago

So I added ABI tags, with one tag per TU chosen according to the target ISA, and encoded into the simd class. In the godbolt below each TU gets its own target-specific name for the same function. Is that sufficient to solve the linking issue you are concerned with?

image

mattkretz commented 1 year ago

Yes :+1:

danieltowner commented 1 year ago

So deduce goes, because it can be completely replaced by fixed_size_simd. What about resize_t and rebind - do they go too?

mattkretz commented 1 year ago

I still find rebind_simd_t<int, floatv> easier to read and write than fixed_size_simd<int, floatv::size()>. But that's maybe because I still need to unlearn all the extra meaning fixed_size carried in the TS.

Do you think fixed_size_simd is still the best choice of name if LEWG agrees with the change in direction? What about sized_simd<int, floatv::size()>? Or we go all in with the size interface and rename the simd class template to basic_simd and the fixed_size_simd alias to simd.

mattkretz commented 1 year ago

A consequence of renaming simd to basic_simd would be that

template <class T, class Abi>
void f(simd<T, Abi> x) { ... }

will not become

template <class T, size_t N>
void f(simd<T, N> x) { ... }

(because N is not deducible) but rather have to be

template <class T, class Abi>
void f(basic_simd<T, Abi> x) { ... }

Which probably isn't all too terrible with the basic_string etc. precedents we have in the stdlib.

danieltowner commented 1 year ago

Why can't it be <T,size>? std::array does that, so what is different about it? Frankly I don't know enough about why deducibility is needed here to understand the issue.

If I had a clean slate, I'd have:

simd<float> Native size for target simd<float, 3> Specified size

I think with rebind_simd_t it mostly (for me anyway) shows up contexts where I only have simd<T, Abi> to hand, so I have to write:

rebind_simd_t<newType, simd<T, Abi>>

so it turns out to be simpler to say:

simd<newType, Abi>

instead. That feels closer to what I often want - given a local simd type, change its value type but not its size.

mattkretz commented 1 year ago

The way array<T, N> is passed to functions never changes with any compiler flags. For simd<T, N> that's not the case. https://compiler-explorer.com/z/eq46hWM96. If there was no additional ABI tag template parameter, simd<T, N> would be a much bigger footgun than the current simd<T, Abi> already is.

The other reason why I actually prefer <T, N> to not be easily accessible is that users immediately start writing code that doesn't scale with the SIMD width. What is missing is a partial CTAD like:

template <typename T, typename Abi>
void f(simd<T, Abi> x) {
  simd<int> i = x;  // deduce Abi for i, like rebind_simd_t<int, simd<T, Abi>>

Which is why I recommend the pattern at the top of https://en.cppreference.com/w/cpp/experimental/simd.

mattkretz commented 1 year ago

Oh, and if you consider SandyBridge, simply reusing the Abi tag of a simd<float, Abi> for a simd<int, Abi> will break. You need the rebind_simd_t for portability.

danieltowner commented 1 year ago

Oh, and if you consider SandyBridge, simply reusing the Abi tag of a simd<float, Abi> for a simd<int, Abi> will break. You need the rebind_simd_t for portability.

Sorry, maybe I'm missing something - why does it break?

mattkretz commented 1 year ago

OK, it doesn't have to break. In this case it's my implementation hiding the existence of an ABI tag just because there are no integer instructions for ymm registers. I should reconsider that choice.

danieltowner commented 1 year ago

The way array<T, N> is passed to functions never changes with any compiler flags. For simd<T, N> that's not the case. https://compiler-explorer.com/z/eq46hWM96. If there was no additional ABI tag template parameter, simd<T, N> would be a much bigger footgun than the current simd<T, Abi> already is.

The other reason why I actually prefer <T, N> to not be easily accessible is that users immediately start writing code that doesn't scale with the SIMD width. What is missing is a partial CTAD like:

template <typename T, typename Abi>
void f(simd<T, Abi> x) {
  simd<int> i = x;  // deduce Abi for i, like rebind_simd_t<int, simd<T, Abi>>

Which is why I recommend the pattern at the top of https://en.cppreference.com/w/cpp/experimental/simd.

Okay, I see.

Telecomms systems actually have lots of places where data is processed in blocks of specific sizes, so fixed_size_simd allows us to write code which doesn't need to care about how that size correlates with the target's capabilities. So, I don't want that mechanism to be too inaccessible.

mattkretz commented 1 year ago

Is there any chance I could see code like it somewhere? I'd really be interested to understand the code patterns you're dealing with.

mattkretz commented 1 year ago

FWIW, I just worked on the GNU Radio 4.0 core redesign with some colleagues at work. Isn't that kind of "signal processing" work very similar?

danieltowner commented 1 year ago

I can't share any of our code with you, sorry. The sorts of problems we see:

danieltowner commented 1 year ago

OK, it doesn't have to break. In this case it's my implementation hiding the existence of an ABI tag just because there are no integer instructions for ymm registers. I should reconsider that choice.

Will rebind_simd_t<NewType, simd<OldTypeAbi>> ever give something different to simd<NewType, OldTypeAbi>?

mattkretz commented 1 year ago

I can't share any of our code with you, sorry.

NP. I didn't expect you to.

I'm coming from an HPC, process-lots-of-data-in-a-very-similar-way perspective on SIMD. That surely influenced what I considered important use cases. (Section 2 of https://wg21.link/p0851 documents my design guidelines.) So if there's any chance that I'm missing important uses cases that require different API/design guidelines I'd really like to learn.

Will rebind_simd_t<NewType, simd<OldTypeAbi>> ever give something different to simd<NewType, OldTypeAbi>?

Yes, https://compiler-explorer.com/z/czM1veh35

danieltowner commented 1 year ago

I can see that the underlying ABI changes in some circumstances giving different types for rebind and fixed_size: https://compiler-explorer.com/z/5dbqGzxTh

Why does it need to do that? If the ABI changes, does that means that there is a behaviour change or difference in the generated code? Does that make it confusing for the user, since two types, both representing the same number of short values, are somehow different, and potentially generate difference code?

I looked through the latest document and can't find anything which uses rebind_simd_t anyway,

mattkretz commented 1 year ago

I gave you a link to the TS implementation. So the TS spec is relevant, which asks (well, it can't be normative) for the fixed_size ABI to turn the simd type into an ABI stable type (i.e. can be passed over ABI boundaries). So basically fixed_size in the TS was a second universe in parallel to the implementation-defined ABI tags.

What I did for re(bind|size)_simd in the TS was to prefer an implementation-defined ABI tag, or, if none is defined, fall back to fixed_size. If I had more time (or man-power) you wouldn't get to see fixed_size on rebind/resize at all. But you'd get to see something very similar to what the TS fixed_size does. At this point I'm leaving the TS implementation in maintenance mode, to focus on C++26...

danieltowner commented 1 year ago

I'm coming from an HPC, process-lots-of-data-in-a-very-similar-way perspective on SIMD. That surely influenced what I considered important use cases. (Section 2 of https://wg21.link/p0851 documents my design guidelines.) So if there's any chance that I'm missing important uses cases that require different API/design guidelines I'd really like to learn.

I've read through that. I found the convention of using horizontal/vertical in the opposite way to what I typically see quite confusing. Maybe you can use the terms map/reduce instead to make it clearer?

A few of the design and implementation guidelines that I used in the Intel implementation:

mattkretz commented 1 year ago

Or we go all in with the size interface and rename the simd class template to basic_simd and the fixed_size_simd alias to simd.

That's what we did in LEWG.