google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch
Apache License 2.0
4.17k stars 319 forks source link

RFC: can we replace HWY_SCALAR with the new HWY_EMU128? #635

Closed jan-wassenberg closed 2 years ago

jan-wassenberg commented 2 years ago

Hi @boxerab @veluca93 and anyone else interested: Requesting your input on an (subtle) potential change of behavior. HWY_SCALAR was not able to support all ops (see #590). The new HWY_EMU128 (#634) is a new and improved fallback path that can arguably replace it.

Advantages: it supports all ops, and guarantees the availability of 128-bit vectors like all other targets except HWY_SCALAR.

Replacing HWY_SCALAR with HWY_EMU128 would mostly be transparent; both are implemented using normal C++. The only difference is that HWY_SCALAR only supports a single lane, whereas HWY_EMU128 acts like a 128-bit vector.

We'd like to (before v1.0) remove HWY_SCALAR and thus all the 'except HWY_SCALAR..' caveats in documentation, and #if HWY_TARGET != HWY_SCALAR in code. HWY_COMPILE_ONLY_SCALAR would still select the non-SIMD target (now HWY_EMU128 instead of HWY_SCALAR).

Any concerns with this change, or suggestions on how to go about it?

veluca93 commented 2 years ago

My main concern is with how to deal with tail ends of arrays, i.e. say if you want to process a vector of 4001 floats.

Edit: to clarify, as of today you can just templatize over D/vector type and process the tail end one element at a time, I'm not sure whether you could still do this.

boxerab commented 2 years ago

@jan-wassenberg thanks for looping me in. I have code to disable targets for one routine because it gives different results depending on whether fmad is enabled or not on hardware.

hwy::DisableTargets(uint32_t(~HWY_SCALAR));

Would I just replace HWY_SCALAR with HWY_EMU128 ?

jan-wassenberg commented 2 years ago

Thanks for having a look!

as of today you can just templatize over D/vector type and process the tail end one element at a time, I'm not sure whether you could still do this.

Yes indeed, that's still possible with hwy::CappedTag<T, 1> which is supported by all targets. Does that resolve your concern? There's also some suggestions for tail handling.

I have code to disable targets for one routine because it gives different results depending on whether fmad is enabled or not on hardware.

Ah, how about we add a HWY_NATIVE_FMA macro for that? That would be less brittle because older ARMv7 also lack FMA. But yes, you could replace HWY_SCALAR with HWY_EMU128 in the short term.

In case someone else is similarly affected: what you you think of actually replacing the meaning of SCALAR with that of EMU128, i.e. just replacing the scalar implementation with the new one and keeping the name HWY_SCALAR for the new behavior?

veluca93 commented 2 years ago

Thanks for having a look!

as of today you can just templatize over D/vector type and process the tail end one element at a time, I'm not sure whether you could still do this.

Yes indeed, that's still possible with hwy::CappedTag<T, 1> which is supported by all targets. Does that resolve your concern? There's also some suggestions for tail handling.

Sounds good to me (although I guess Capped<1> would have the same problem of not working with many ops)

I have code to disable targets for one routine because it gives different results depending on whether fmad is enabled or not on hardware.

Ah, how about we add a HWY_NATIVE_FMA macro for that? That would be less brittle because older ARMv7 also lack FMA. But yes, you could replace HWY_SCALAR with HWY_EMU128 in the short term.

In case someone else is similarly affected: what you you think of actually replacing the meaning of SCALAR with that of EMU128, i.e. just replacing the scalar implementation with the new one and keeping the name HWY_SCALAR for the new behavior?

jan-wassenberg commented 2 years ago

Sounds good to me (although I guess Capped<1> would have the same problem of not working with many ops)

OK :) FWIW Capped does support more (almost all) ops, because it tends to use the same full-width instructions except when there are side effects. For example, writing to memory is masked or uses smaller-width stores etc.

jan-wassenberg commented 2 years ago

@boxerab the HWY_NATIVE_FMA flag is in.

Looks like we can move forward with #634, which will replace HWY_SCALAR with HWY_EMU128 unless HWY_COMPILE_ONLY_SCALAR is defined. After that, I think we can and should retire HWY_SCALAR and have HWY_COMPILE_ONLY_SCALAR request HWY_EMU128.

boxerab commented 2 years ago

@jan-wassenberg thanks, HWY_NATIVE_FMA will be useful. In my case, I disable SIMD because my unit tests will fail sometimes - the VM running the test may have different instruction set from the machine that generates the test files, so if the VM doesn't use FMA, then it flags the output as error. I just need to add a special unit test command line arg to disable FMA, while keeping it in my actual working codec.

jan-wassenberg commented 2 years ago

YW :) Got it. I suppose it's hard to make the conformance test such that it allows small epsilons? JPEG XL does that, so did the original JPEG for that matter.

FWIW there is a HWY_DISABLE_BMI2_FMA flag, but that only affects AVX2. If you never run AVX-512 (or also disable that), that might be sufficient?

boxerab commented 2 years ago

thanks - there is some tolerance in the lossy conformance tests. I have to dig deeper into that, but there are too many other things at the top of the todo list :)

jan-wassenberg commented 2 years ago

:)

FYI the change is currently rolled back because it exposes a bug in libjxl (using x86-specific instructions when it shouldn't). We'll patch that first, then get the EMU128 change in again.

yzazik commented 2 years ago

@jan-wassenberg If EMU128 won't be using any intrinsics (at least explicitly) then why not emulate a less restrictive architecture like RVV (ie EMURVV)?

jan-wassenberg commented 2 years ago

@yzazik our main goal was to uphold the guarantee that all vectors are at least 128 bits, and all ops are supported. Are you proposing that the "emu128" target set HWY_HAVE_SCALABLE to 1 and choose some configurable size at compile time? That could indeed be done. I suppose we should rename to HWY_EMU then? ;) Or any better suggestion for the name?

Is there any other restriction that you'd like to see lifted? I think EMU128 already supports everything except hardware FMA (which we could actually emulate but probably at a steep cost).

yzazik commented 2 years ago

@jan-wassenberg Here are my biggest philosophical issues with HWY. First, high cost policy, well, cost is relative, there are some things (operations) that should be done, there is just no other way, those either would be done no matter what the cost is or not done at all. And not done at all, just means SIMD is not faster than SCALAR until certain vector size, but could still be done for future wider vectors. Second, small API set, small fast API does not mean the whole code would run fast. Sure, choosing only fast API, means that API is fast, but emulating slower functionality with it would still run slower, so if there is no slower API then there would be slower external implementation, more external implementation - less value for HWY. Look at the STL, the API is huge, but, you learn and you still only use the fast (for the task at hand) part of it.

Let's say I need to shift all lanes of the vector by one lane (either left or right). In hardware (SIMD) It's all great on 128, but not on 256+ (I mean multiblock mess on AVX2 and AVX512).

In other words, if I have SIMD256 polished HWY code (with all workarounds to make that shift multiblock seamless) EMU128 won't not help since permute workarounds may not work with half-EMU128 size (would it?). Even EMU256 that would work like EMU128 (seamlessly with multiblock operations) wouldn't help (because workarounds expect non-seamless behaviour). Only EMU256 that works like SIMD256 (with multiblock mess) would, means either implementing SIMD256-specific EMU256 or reimplementing my SIMD256 HWY code to work with SIMD128. And I want to keep SIMD256 because even with all workarounds it runs idk 30-40% faster than SIMD128. Implementing multiblock messy EMUXXX to behave the same as SIMDXXX is doable, but I guess it would require as many EMUs as SIMDs. Not sure if HWY_HAVE_SCALABLE=1 would help without having multiblock seamless abstractions in the code. Implementing multiblock seamless SIMD for all vector sizes would only need one EMU, that would have been wonderful, but it seems like a different philosophy (different library?).

Now, let's say, I have SIMD128 HWY code for some image processing and I have a loop of number-of-128-bit-vectors-in-the-line (ie by image width) and do the processing in it. Then, let's say, I want the same code to run on EMU128 as scalar or with a chance of auto-vectorization. It would be the same loop but one more loop internally for number of lanes in the vector (that's how EMU128 works, right?), so two nested loops might just screw up auto-vectorization pattern recognition and possibly be slower on scalar too. But if I had my SIMD128 code written as RVV (or something else that has variable vector size) having the vector size set to line width I would only have one inner loop (assuming optimizer would swallow outer loop of one line), that might be better for auto-vectorization and even for scalar too.

I already got used to the idea of having original (non-HWY) scalar code under HWY_SCALAR is better for historical reasons, as a working algorithm description or a fall back than everyone else knows what to do with it. All in all, as I see it, EMU128 is better than deficient HWY_SCALAR, but only useful if you plan to have SIMD128 implementation. Means, if I develop for PC, AVX is probably already not in the picture, so it's NEON for Android only? HWY_HAVE_SCALABLE=1 with EMU<256> won't make my SIMD256 code working, so I'm not really sure I'm proposing that (unless it would).

jan-wassenberg commented 2 years ago

Let's see if I understand your points correctly:

"A relatively small API means users have to implement some things themselves". I agree with this and we are constantly expanding the set of ops, even higher-level STL algorithms (contrib/algo). The guiding policy we have is to minimize performance cliffs (i.e. things that are unavoidably slow on several platforms, or ops that have faster user-level alternatives).

there is just no other way, those either would be done no matter what the cost is or not done at all

Do you have examples of such operations?

I mean multiblock mess on AVX2 and AVX512

We've chosen to include "independent block" semantics in the API because it's pretty easy to match that using SVE/RVV, but not the opposite on x86. This means that even EMU256 would have 2 independent 128-bit blocks.

I think you're saying such an EMU256 is actually desirable for your use-case?

Implementing multiblock messy EMUXXX to behave the same as SIMDXXX is doable, but I guess it would require as many EMUs as SIMDs

I think it could all be done in one implementation, just using either nested loops (per-block and then per-lane) or a loop that figures out which block a lane belongs to. Now that I think about it more, it's probably good to have HWY_HAVE_SCALABLE=0 because we know the size of EMU at compile-time, and users can rely on that.

two nested loops might just screw up auto-vectorization pattern recognition

Unfortunately the auto-vectorizer is anyway unreliable - see trainwreck even for a simple use case in #264.

I already got used to the idea of having original (non-HWY) scalar code under HWY_SCALAR is better for historical reasons,

Ah, you are using normal C++ code (not even Highway's scalar ops) #if HWY_TARGET == HWY_SCALAR? That would be a reason to at least keep HWY_SCALAR, but perhaps we could still remove the ops in scalar-inl.h.

EMU128 is better than deficient HWY_SCALAR, but only useful if you plan to have SIMD128 implementation

I agree with this. Generally, we'd want to use one of the actual instruction sets (via intrinsics), but just in case there is some compiler bug, or some other temporary reason to disable SIMD, we'd still have something working/testable (though slower).

Jamaika1 commented 2 years ago

FYI the change is currently rolled back because it exposes a bug in libjxl (using x86-specific instructions when it shouldn't). We'll patch that first, then get the EMU128 change in again.

https://github.com/libjxl/libjxl/commit/98d25826edc1d93284ba36928b96c0f13968e940

malaterre commented 2 years ago

FYI, Emu128 does not work well with gcc-11 and below on armhf. ref:

jan-wassenberg commented 2 years ago

Thanks, we can avoid choose EMU128 as the default for such GCC, sending a patch soon.

jan-wassenberg commented 2 years ago

Looks like we'll be keeping HWY_SCALAR for some time yet because it is sometimes used to work around compiler bugs.

malaterre commented 2 years ago

@jan-wassenberg are you closing this because of:

jan-wassenberg commented 2 years ago

@malaterre This issue was mainly intended to answer the question: can we deprecate/remove HWY_SCALAR soon. It looks like the answer is (unfortunately) no, hence I closed it. Would you like to keep this open to track the compiler bug you reported?

malaterre commented 2 years ago

I think it would make sense to keep one "meta" bug for EMU128 transition and the current compiler issues:

I'll revert the Debian package to SCALAR to match upstream (highway Debian package is mostly there for libjxl dependency)