kokkos / simd-math

Library for length agnostic SIMD intrinsic support and the corresponding math operations
Other
20 stars 10 forks source link

Identify features we want that are not in ISO #2

Open ibaned opened 5 years ago

ibaned commented 5 years ago
nmhamster commented 5 years ago

@ibaned / @alanhumphrey - in almost every compiler I've used with intrinsics, a multiply followed by an add intrinsic is converted to an FMA (if they compiler has FMA enabled). I would like us to avoid using lots of fancy (but unnecessarily complex) C++ to achieve what a minimal peephole optimizer can do.

nmhamster commented 5 years ago

Also, I wasn't sure why we still hadn't evaluated the use of vector_size attribute in GNU and Clang (https://clang.llvm.org/docs/LanguageExtensions.html). It seems we can build generic vector interfaces without the need for intrinsic functions, that would apply across platforms. I realize this is a C API and needs a C++ wrap, but the compiler should produce pretty well optimized code for this kind of attribute, perhaps better than intrinsics in some case if prefetches and other optimization phases fire (where intrinsics block them).

ibaned commented 5 years ago

@nmhamster I personally was unaware of that. The way the ISO interface works, we have template specializations called "ABI"s (a bit of misnomer). Some of those "ABI"s are directly calling intrinsics, but I also implemented one where the data type was just float[4] and the loops have pragma omp simd in front of them. That actually gave decent (but not as good) speedup. I suppose we can do the same this for this approach, create an "ABI" specialization that implements things this way, and compare its performance. I do worry a little bit about the gaps in support in the table on that page, especially the lack of support for boolean operators.

nmhamster commented 5 years ago

@ibaned - right, I was thinking the same thing. I am interested to see what performance we get. The GCC vector attributes do support boolean operators as expr ? true-value : false-value (a little like your vector choose function). Producing masking behavior without true mask support in hardware is quite hard because in the end you usually have to rely on AND operations to get something similar (which I'm sure you know having written things for SSE in the library).

DavidPoliakoff commented 5 years ago

I would really recommend having a test suite, looping in James Elliott, and tracking performance across compilers. At LLNL we were toying with these kinds of libraries as I left, and it felt like every month we'd find out that such-and-such a compiler suddenly wasn't optimizing such-and-such a mechanism well anymore. If we have this work and a guide saying which compilers do better with which ABI's, we're in a good place.

nmhamster commented 5 years ago

Rather than just relying on profiling, I think we need to first hand actually take a good look at the code which is being generated and some of the compiler output. A human in the loop during development is essential to understanding why the compiler behaves as it does. Once we have that settled down a little more, I think the transition to profiled-based on-going assessment will be useful. I am particularly interested in whether we actually execute the vectorized code even in the event that we generate it since Intel in particular has some interesting runtime choices which sometimes make this not the case. In short, we should do some homework here as a preliminary step.

alanphumphrey commented 5 years ago

@DavidPoliakoff - Agreed on the test suite, etc. We talked at some length about this Friday. Also agree with @nmhamster on having a human in the loop initially, doing our homework, e.g., seeing what code is generated and whether we actually execute that vectorized code.

I will transition fully to this effort early next week (0.60 FTE is my SNL contract), and can stay on it for the necessary duration.

Thanks @ibaned for getting this conversation started.

ibaned commented 5 years ago

Since this issue was originally about missing pieces to the ISO interface, I'm going to answer the question that @alanphumphrey asked in the other issue because it fits better here.

My thinking is that we should try to propose changes to the ISO interface, especially where we see that it cannot be as fast as hand-coding without those changes or it is super inconvenient without them. So far I think there are three changes we can think about individually:

  1. make many <cmath> functions like sin and cos work with simd types
  2. provide a conditional operator of some kind
  3. provide a scatter/gather interface, assuming there are special intrinsics for this and it is not just loading scalars (the stk_simd strided load is just loading scalars)
ibaned commented 4 years ago

@nmhamster I have some early data on different high-level approaches using a full and very non-trivial Sandia application built using Clang on Mac:

  1. using float[8] with pragma clang loop vectorize(enable): 20 seconds
  2. using float __attribute__((vector_size(32))): 15 seconds
  3. using direct AVX intrinsics: 10 seconds

It seems like calling vendor-specific intrinsics can still be way better in many cases.

mhoemmen commented 4 years ago

@ibaned wrote:

2\. provide a conditional operator of some kind

Matthias Kretz had a proposal to permit overloading the ternary operator.

ibaned commented 4 years ago

@mhoemmen awesome! We just chose a random name for it and use it as a function (choose(cond,tv,fv)). should be as easy as moving its implementation once ternary can be overloaded.

mhoemmen commented 4 years ago

@ibaned It's still just a proposal :-) Not sure how long it will take to get through.

ibaned commented 4 years ago

Yep won't hold my breath :)

ibaned commented 4 years ago

In the spirit of recording things that we might want to create ISO C++ papers about, @alanw0 and the STK team identified that an equivalent of std::copysign is useful, and also this multiplysign:

T multiplysign(T a, T b) {
  return a * copysign(1.0, b);
}