Open ibaned opened 5 years ago
@ibaned / @alanhumphrey - in almost every compiler I've used with intrinsics, a multiply followed by an add intrinsic is converted to an FMA (if they compiler has FMA enabled). I would like us to avoid using lots of fancy (but unnecessarily complex) C++ to achieve what a minimal peephole optimizer can do.
Also, I wasn't sure why we still hadn't evaluated the use of vector_size
attribute in GNU and Clang (https://clang.llvm.org/docs/LanguageExtensions.html). It seems we can build generic vector interfaces without the need for intrinsic functions, that would apply across platforms. I realize this is a C API and needs a C++ wrap, but the compiler should produce pretty well optimized code for this kind of attribute, perhaps better than intrinsics in some case if prefetches and other optimization phases fire (where intrinsics block them).
@nmhamster I personally was unaware of that. The way the ISO interface works, we have template specializations called "ABI"s (a bit of misnomer). Some of those "ABI"s are directly calling intrinsics, but I also implemented one where the data type was just float[4]
and the loops have pragma omp simd
in front of them. That actually gave decent (but not as good) speedup. I suppose we can do the same this for this approach, create an "ABI" specialization that implements things this way, and compare its performance. I do worry a little bit about the gaps in support in the table on that page, especially the lack of support for boolean operators.
@ibaned - right, I was thinking the same thing. I am interested to see what performance we get. The GCC vector attributes do support boolean operators as expr ? true-value : false-value
(a little like your vector choose function). Producing masking behavior without true mask support in hardware is quite hard because in the end you usually have to rely on AND operations to get something similar (which I'm sure you know having written things for SSE in the library).
I would really recommend having a test suite, looping in James Elliott, and tracking performance across compilers. At LLNL we were toying with these kinds of libraries as I left, and it felt like every month we'd find out that such-and-such a compiler suddenly wasn't optimizing such-and-such a mechanism well anymore. If we have this work and a guide saying which compilers do better with which ABI's, we're in a good place.
Rather than just relying on profiling, I think we need to first hand actually take a good look at the code which is being generated and some of the compiler output. A human in the loop during development is essential to understanding why the compiler behaves as it does. Once we have that settled down a little more, I think the transition to profiled-based on-going assessment will be useful. I am particularly interested in whether we actually execute the vectorized code even in the event that we generate it since Intel in particular has some interesting runtime choices which sometimes make this not the case. In short, we should do some homework here as a preliminary step.
@DavidPoliakoff - Agreed on the test suite, etc. We talked at some length about this Friday. Also agree with @nmhamster on having a human in the loop initially, doing our homework, e.g., seeing what code is generated and whether we actually execute that vectorized code.
I will transition fully to this effort early next week (0.60 FTE is my SNL contract), and can stay on it for the necessary duration.
Thanks @ibaned for getting this conversation started.
Since this issue was originally about missing pieces to the ISO interface, I'm going to answer the question that @alanphumphrey asked in the other issue because it fits better here.
My thinking is that we should try to propose changes to the ISO interface, especially where we see that it cannot be as fast as hand-coding without those changes or it is super inconvenient without them. So far I think there are three changes we can think about individually:
<cmath>
functions like sin
and cos
work with simd
typesstk_simd
strided load is just loading scalars)@nmhamster I have some early data on different high-level approaches using a full and very non-trivial Sandia application built using Clang on Mac:
float[8]
with pragma clang loop vectorize(enable)
: 20 secondsfloat __attribute__((vector_size(32)))
: 15 secondsIt seems like calling vendor-specific intrinsics can still be way better in many cases.
@ibaned wrote:
2\. provide a conditional operator of some kind
Matthias Kretz had a proposal to permit overloading the ternary operator.
@mhoemmen awesome!
We just chose a random name for it and use it as a function (choose(cond,tv,fv)
). should be as easy as moving its implementation once ternary can be overloaded.
@ibaned It's still just a proposal :-) Not sure how long it will take to get through.
Yep won't hold my breath :)
In the spirit of recording things that we might want to create ISO C++ papers about, @alanw0 and the STK team identified that an equivalent of std::copysign
is useful, and also this multiplysign
:
T multiplysign(T a, T b) {
return a * copysign(1.0, b);
}
if_then_else
instk_simd
,choose
in prototype)cmath
functions (the ISO paper doesn't mention these I think...)