Performance Warning Support

seanptmaher commented 4 years ago

Hello,

First off, I've been looking over the code that's been written so far, and it's very solid. Thank you all for your work on the library.

I've had some concerns expressed to me over performance-related issues in SIMDe, where due to the way that SIMDe is built, it will fallback to scalar operations on operations not supported on the target platform, but this isn't the behavior that some people would be looking for.

How feasible would it be to implement warnings/errors when trying to compile instructions that would fallback to scalar operations on certain platforms? I assume this would be opt-in via #define.

Having code silently run a lot slower perhaps than expected is not ideal for some applications.

I'm happy to work on this myself, given a slight nudge in the right direction in terms of long-term vision for the project, it's possible this is out of scope of what you'd like to implement.

Somewhere to look for a user with a similar problem is here, where application performance is more important than code coverage. Because a lot of the work done here is duplicated in SIMDe, I feel that adding this support might be better than the alternative.

Thanks

(CC @ngzhian @tlively)

nemequ commented 4 years ago

The best thing I can think of would be to use HEDLEY_WARNING in simde-common.h to do something like:

#if defined(SIMDE_WARN_ON_PORTABLE_FALLBACK)
  #define SIMDE_PORTABLE_FALLBACK_REACHED
#else
  #define SIMDE_PORTABLE_FALLBACK_REACHED HEDLEY_WARNING("Portable fallback reached, performance may not be optimal")
#endif

(I would be open to a different name). Then, every time you hit a portable fallback in the code you would have to add SIMDE_PORTABLE_FALLBACK_REACHED.

While I'm sympathetic to the idea, the reason I haven't done this already is that a lot of the portable fallbacks can be optimized by the compiler every bit as well as by a human, as you can see by defining SIMDE_NO_NATIVE.

It's also worth noting that we use various standard and compiler-specific extensions to help the optimizer here; we use GCC-style vector extensions when possible, builtins such as __builtin_shuffle/__builtin_shufflevector and __builtin_convertvector. That's actually how some compilers implement the "native" functions anyways. I'm not really sure it would be fair to classify those as portable fallbacks.

Even if we have to fall back on a loop, almost always have a SIMDE_VECTORIZE on the loop which will use OpenMP SIMD, Cilk Plus, GCC loop-specific pragmas, and clang's pragma loop. Again, with those the compiler can often generate something just as good as a human.

There are also functions which really can't be implemented using other SIMD instruction sets and the best option really is to just fall back on serial code. In that case, there isn't always a way to work around the issue, so now you have a warning that always gets tripped but you can't do anything other than turn off that warning; you certainly can't rely on it to pick up any sub-optimal code in CI.

While I would agree that something like this could be useful in theory, I think there would be too many false positives for it to be very useful in practice.

I'm open to being persuaded on this, especially if someone has a better idea about how to handle the issues I mentioned above. Even if not, if @ngzhian and/or @tlively feel strongly about it I'd probably be willing accept a patch.

tlively commented 4 years ago

There are also functions which really can't be implemented using other SIMD instruction sets and the best option really is to just fall back on serial code. In that case, there isn't always a way to work around the issue, so now you have a warning that always gets tripped but you can't do anything other than turn off that warning; you certainly can't rely on it to pick up any sub-optimal code in CI.

This is the case where performance warnings would be most useful in my opinion. Our users want the easiest porting experience possible, but they are also willing to do some work for large performance wins. For example, they don't want to port their entire application to use WebAssembly intrinsics, but they would be willing to rewrite a few hot functions to use algorithms that can be more efficiently lowered to the available WebAssembly SIMD instructions. Performance warnings could help them identify those few functions that need restructuring.

For the other cases you mentioned I agree that if the compiler can generate efficient code, a warning would not be useful.

nemequ commented 4 years ago

That's fair, but wouldn't profiling be a much better solution for that? I haven't actually looked into the WASM situation in too much depth, are there good tools for profiling?

I can see a warning that says "you're using a function that's likely slow on WASM SIMD" being useful in that situation, as long as the criteria isn't just be any time we hit a portable fallback. I'm having trouble coming up with an objective criteria we could use, but even if it's completely subjective it could still be useful as long as people understand that it's not really authoritative and they should really be profiling. If someone wants to manually mark functions that are likely to be slow for a specific architecture I wouldn't mind carrying them in SIMDe

If there isn't an way to override the warning for a specific call I think it would lose a lot of its utility. On clang I could throw together a macro that disables the warnings generated by _Pragma("GCC warning ...") so you could just do something like SIMDE_DISABLE_PERFORMANCE_WARNINGS(expr), but AFAIK GCC doesn't have anything like -Wno-#pragma-messages, so it may have to be a clang-only feature. That's probably not a problem for the WASM target, though it would be nice to have a generic mechanism if possible.

A small warning: there are ~ 6,000 functions in Intel's SIMD APIs. I think about 2,000 have been implemented in SIMDe so far.

tlively commented 4 years ago

That's fair, but wouldn't profiling be a much better solution for that? I haven't actually looked into the WASM situation in too much depth, are there good tools for profiling?

Yes, profiling is definitely better, but it's not easier ;) This is a good chance as tool authors to be as helpful to our users as possible at compile time. There are some tools for profiling WebAssembly like Chrome's built-in profiler, but I'm not sure how fine-grained they get. Perhaps @ngzhian knows?

I can see a warning that says "you're using a function that's likely slow on WASM SIMD" being useful in that situation, as long as the criteria isn't just be any time we hit a portable fallback.

I agree, and I think the criteria should be based on actual (micro?)benchmarking of the instruction sequences emitted by the toolchain and engine. We can do some benchmarking and profiling to make the first 80% of the work easy for our users and they can do additional profiling to finish the job themselves.

If there isn't an way to override the warning for a specific call I think it would lose a lot of its utility. On clang I could throw together a macro that disables the warnings generated by _Pragma("GCC warning ...") so you could just do something like SIMDE_DISABLE_PERFORMANCE_WARNINGS(expr), but AFAIK GCC doesn't have anything like -Wno-#pragma-messages, so it may have to be a clang-only feature. That's probably not a problem for the WASM target, though it would be nice to have a generic mechanism if possible.

Agreed 👍

A small warning: there are ~ 6,000 functions in Intel's SIMD APIs. I think about 2,000 have been implemented in SIMDe so far.

😰

ngzhian commented 4 years ago

I don't think tooling is there right now, there are plans to make it much better, but it will take time.

I think we are all in alignment in that we want to be helpful, without being, uh, noisy :). Given our status, we should probably prioritize implementation of the headers. As we get more data points (issues from users?), we can prioritize what we are discussing here. Wdyt?

nemequ commented 4 years ago

That sounds reasonable, but I can at least add the infrastructure to SIMDe now so we can start annotating functions as we notice them even if finding slow functions isn't the priority.

I was thinking about it a bit more, and I think instead of a warning pragma it would be better to use a function attribute. GCC's warning would be perfect but clang doesn't support it. We could abuse the deprecated attribute, though. I'm thinking something like (untested):

#define SIMDE_SLOW_FUNCTION_IMPL \
    __attribute__((__deprecated__("possibly slow function called")))

#if \
    defined(SIMDE_WASM_SLOW_FUNCTION_WARNINGS) || \
    ( defined(SIMDE_SLOW_FUNCTION_WARNINGS) && \
      defined(SIMDE_WASM_SIMD128_NATIVE) )
  #define SIMDE_WASM_SLOW_FUNCTION SIMDE_SLOW_FUNCTION_IMPL
#else
  #define SIMDE_WASM_SLOW_FUNCTION
#endif

#define SIMDE_SLOW_FUNCTION(expr) (__extension__({ \
    _Pragma("clang diagnostic push") \
    _Pragma("clang diagnostic ignored \"-Wdeprecated-declarations\"") \
    __typeof__(expr) simde_slow_function_val_ = (expr); \
    _Pragma("clang diagnostic pop") \
    simde_slow_function_val_; \
  }))

Then, just add SIMDE_WASM_SLOW_FUNCTION to the function you want to warn about. Definining SIMDE_SLOW_FUNCTION_WARNINGS would cause warnings for whatever extension is currently being targeted (e.g., WASM) to be emitted. Alternatively, you could just define SIMDE_WASM_SLOW_FUNCTION_WARNINGS and you get the warnings for WASM regardless of SIMDe is currently targeting (useful if you're planning on targeting WASM eventually but doing the development targeting x86 so you can use all the x86 tooling). To call a slow function but ignore the error, SIMDE_SLOW_FUNCTION(simde_mm_mask_compress_epi32(src, k, a)).

I'm definitely open to other names.

Of course getting it to work on other compilers would require a bit more work, but nothing too difficult. Support for some sort of deprecated annotation is pretty good, and most compilers have a way to disable the warning in code, so if we abuse the deprecated attribute I think it should work pretty much everywhere except MSVC in C or C++ < C++11 mode; they have the deprecated attribute, but not statement exprs. I think lambdas could be usable in C++.

How does that sound to everyone?

simd-everywhere / simde

Performance Warning Support #310