google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch
Apache License 2.0
4.25k stars 322 forks source link

coding a replacement for Vc::SimdArray with highway #641

Closed kfjahnke closed 2 years ago

kfjahnke commented 2 years ago

I have coded an arithmetic template class mimicking Vc::SimdArray to allow a tentative port of my application lux from Vc to highway. The stand-in type is not complete, but it provides a substantial subset of Vc::SimdArray - sufficient to provide the functionality which lux uses: It does cover all operators, many mathematical functions, load, staore, gather and scatter, plus a few bits and bobs. It also provides masked operation, including the extremely helpful Vc idiom V(mask)op=... This is the strategy I used: lux can fall back to use it's 'own' simd_type, which also mimicks Vc::SimdArray and provides the functionality using 'goading'. So this type provides the SIMD functionality I need for lux, and I used it to implement the highway equivalent, replacing the small loops over the internal POD C array with vectorized loops using highway, falling back to the scalar loops where I could not find a suitable implementation or where this felt unnecessary. The resulting type slots in, and I could build my application using it for all SIMD arithmetic. I haven't yet coded de/interleaving to match Vc's InterleavedMemoryWrapper, but even without that, the code performs just about as the code using Vc. This is definitely much better than the 'goading' implementation, and makes me hope that a port to highway can be done without sacrificing performance. I felt the port was becoming necessary because Vc support for newer ISAs seems to lessen. A down side of my strategy is using a memory-backed type which has to rely on the compiler to 'see' opportunities to avoid moving data from registers to memory and back. The positive aspect of a memory-backed type is the fact that it is a 'normal' object even though it may use sizeless vectors internally, so it can be put into containers etc. I haven't yet managed to test the code on a 'sizeless' platform, but this is my main reason for unsing highway, so that I can port lux to use SVE. I'd appreciate someone with access to, e.g., an M1 to try the code on such a machine! The template class definition is here - this is still an early version, but since it has compiled and run correctly inside my application, I'm quite confident it's not too buggy, but it's not tweaked very much for speed.

jan-wassenberg commented 2 years ago

This sounds like good progress! It's nice to hear that the memory-backed approach is working well so far :)

An M1 still only supports Neon but Amazon's c7g instances are Graviton3 aka ARM V1 with 256-bit SVE. You could request access to that in order to allow benchmarking there?

kfjahnke commented 2 years ago

This sounds like good progress

Took a week, but now it's up and running.

It's nice to hear that the memory-backed approach is working well so far

Initially I compared it to the aproach using references to stack-allocated vectors, and to plain aggregates of vectors with sized vectors only, and it came out on top, so I went ahead with it. Its' also very convenient with the fallback to goading close-by where vector code would be awkward or need lots of effort. Using this new type is also a good way to roll out stuff in a generic way, and I think that using an approach where you can effortlessly interoperate a single vector with small T with several vectors of large T is better than using half- or quarter-sized vectors for the smaller T to get interoperability. And, again, I noticed that running my application with a lane count of several times the hardware vector width did perform better.

An M1 still only supports Neon

Oops - I though that was SVE already... A collegue of mine has managed a port to M1 with the goading variant, which is already quite fast, so I'll now pester him to try out the new variant. I don't have apple silicon myself... You can see my code is liberally licensed (should be compatible with yours), and it's already written so that it should fit into the highway cosmos and it's roll-out-to-multiple-targets strategy. How about tentatively pulling it into highway and dragging it through your CI/CD? That should show up problems coming from sizeless vectors which I can't figure out here. But thanks for the link, anyway, I may consider it. You may even like using my code, it's very convenient, especially to get something going quickly, or port from Vc. Since it's an arithmetic type, oftentimes you can even just write down a formula and have it 'magically' vectorized. A bit like FORTRAN ;)

jan-wassenberg commented 2 years ago

I noticed that running my application with a lane count of several times the hardware vector width did perform better.

Good to have another data point. I also find that unrolling is usually still helpful.

Sure, would be happy to run this through the CI/CD, we might require several iterations just in case the various platforms/compilers have some quirks. After that I think we can add this to the contrib/ folder..

Some initial quick comments: 1) I think you can unconditionally use hn::CappedTag < value_type , vsize > , that should work for both scalable and fixed vectors. 2) line 159: can use Mask instead of decltype(Eq) 3) I think we also need alignas on the simd_type, not just its members. And also make sure that no one tries new simd_type, that won't be aligned.

my code is liberally licensed (should be compatible with yours)

That's helpful. To avoid confusion, would you mind re-licensing or dual-licensing as Apache? That would avoid any license-related concern or confusion.

kfjahnke commented 2 years ago

Sure, would be happy to run this through the CI/CD, we might require several iterations just in case the various platforms/compilers have some quirks. After that I think we can add this to the contrib/ folder

Sounds good to me. And may help other people trying to port from Vc. If you find any quirks, let me know, I'm happy to modify the code to become as portable as possible. Thanks for your hints:

I think you can unconditionally use hn::CappedTag < value_type , vsize >

That does work - I deliberately made the distinction to make sure that using ScalableTag would work all right with cases where the hardware lane count is less than the _vsize passed as template arg, but maybe I'm just being paranoid and it comes down to the same thing.

line 159: can use Mask instead of decltype(Eq)

Okay, figured out: typedef hn::Mask < D > vmask_type ; I did not find that in the docu.

I think we also need alignas on the simd_type, not just its members.

I added HWY_ALIGN to the struct definitions, like struct HWY_ALIGN simd_type. That should do the trick, right? I pushed the modified code to my bitbucket repo just now.

jan-wassenberg commented 2 years ago

Sounds good!

maybe I'm just being paranoid and it comes down to the same thing.

Yes indeed, it is intended to be the same and nice to minimize #if.

Thanks, right, Mask wasn't in the quick_reference, adding that shortly.

kfjahnke commented 2 years ago

Yes indeed, it is intended to be the same and nice to minimize #if

The Lanes(D()) you get from CappedTag isn't a constexpr, though. Wouldn't that be possible with CappedTag? I just pushed a commit to enforce power-of-two vsize for hwy_simd_type, but that might be relaxed to a multiple of the hardware lane count - which I cant seem to get as constexpr.

jan-wassenberg commented 2 years ago

There is a MaxLanes which is constexpr, but I'm not sure we need it. You can skip the power of two check because the Simd<> template already ensures that, and CappedTag<T, N> is just Simd<T, min(N, unspecified), 0>.

kfjahnke commented 2 years ago

Okay, I've thrown the power-of-two check out. simd_type can actually run with a multiple of the hardware lane count which doesn't have to be a power of two. If the underlying hwy vector type enforces power-of-two lane count, that's all right. I've also thrown out the allocator-related code which was misplaced in hwy_simd_type.h, so be now the header can be used without any references to vspline or vigra. I started testing the code, and for my AVX2 processor, compiling with -mavx2 -march=haswell -maes I do get proper AVX2 binary, which is only a bit slower than the Vc equivalent, likely because it still needs some tweaking like a better atan2 implementation (I'll try IsInf and IsNaN), and lacks Vc::InterleavedMemoryWrapper. But I can't figure out suitable compiler flags for other ISAs. I realize that highway does not support AVX, but can you recommend the flags I should use with clang++ to get SSE4.2 and AVX512f code? I tried to pick the flags out from set_macros_inl.h, but I could not get it right. I've further commodified use of vectorization in lux, and now I can even build 'mixed' binaries, where different ISAs can be implemented with different SIMD libraries - a benefit of using my own dispatching mechanism which works at a high level of the code. So far, my highway wrapper is never faster than the Vc variants, but I anticipate that on platforms which Vc does not support highway will be the better option. With the choice of SIMD library being merely a matter of editing CMake options, comparisons become easier - I have four choices now: Vc, highway, goading and std::simd. Of course the performance measurements I get out also reflect my - possibly inept - use of the libraries, so they have to be taken with a pinch of salt. BTW - just rebuilding from a fresh pull, I get bombarded with warnings à la highway/hwy/ops/emu128-inl.h:134:11: warning: conversion from ‘int’ to ‘signed char’ may change value [-Wconversion] - the build runs through, though.

jan-wassenberg commented 2 years ago

Nice :)

IsInf and IsNaN

FYI the code is done but still under review. You could already patch in #645 if you like.

can you recommend the flags I should use with clang++

Sure, they are listed in comments here: https://gcc.godbolt.org/z/rGnjMevKG

now I can even build 'mixed' binaries, where different ISAs can be implemented with different SIMD libraries

Ooh, nice, this is interesting for benchmarking.

Thanks, I've fixed the warning.

kfjahnke commented 2 years ago

Sure, they are listed in comments here: https://gcc.godbolt.org/z/rGnjMevKG

Thanks! So I did get it right after all, only the outcome was not what I expected. The binary I get using the flags for SSE4.2 takes about 28 msec per one of my test frame vs. 18 with the Vc version. So the missing bits in my hwy port seem to have a much larger impact with that ISA than with AVX2, where it's 8.4 vs 7. I can't test AVX512 here, but I looked at the assembler output with -march=skx and it has the 512 bit wide ops all right.

I mentioned Vc::InterleavedMemoryWrapper, which is a nice bit of code if a bit hard to use - hwy has StoreInterleaved3 and StoreInterleaved4, but the corresponding Load routines would also be helpful, and having both for Interleaved2 would make the set full (useful for 2D coordinates in imaging, e.g. for Remap routines. I realize that you have InterleaveLower and InterleaveUpper). Now I'll stop nagging you ;)

jan-wassenberg commented 2 years ago

Interesting, let's see if there is still a difference after the IsNaN.

Sure, LoadInterleaved and StoreInterleaved2 are on the todo list. Would you be using them soon-ish? If so, I can increase the priority on them :)

kfjahnke commented 2 years ago

Interesting, let's see if there is still a difference after the IsNaN.

I intend to set up member functions in class simd_type using IsNaN, IsFinite and Copysign to provide the same interface as Vc::SimdArray. Then I want to simply take over the Vc implementation of atan2, which makes ample use of the V(M)op=... idiom. I had a go at 'translating' the Vc code to highway, but it's hard work and error-prone because the logic is quite different. Take, for example, this bit of code taken from the Vc implementation, where xInf and yInf are masks:

    // if both inputs are inf the output is +/- (3)pi/4
    a(xInf && yInf) += copysign(C::_pi_4(), x ^ ~y);

You can see how the syntax facilitates coding complex operations in a human-readable form. The equivalent in highway is quite a mouthful.

Sure, LoadInterleaved and StoreInterleaved2 are on the todo list. Would you be using them soon-ish? If so, I can increase the priority on them :)

I would try and use them as soon as they are available, but what I'd really like to see is a generalized version like Vc::InterleavedMemoryWrapper, where the channel count can be set with a template argument. This would facilitate generic programming. As it stands I use goading for now except for the AVX2 target where I have managed to get gather/scatter to run correctly with highway, and I can special-case for channel count 3 and 4 to the corresponding specialized hwy routines, and have to write yet more special-case code if I want to use 2-channel data. If I had a uniform interface both ways this would make things much easier on my side, and I'm sure other users would agree (anybody out there?). The difficult bit is to generalize the code to handle arbitrary channel counts (or, at least, up to some sensible limit) and to orchestrate the shuffling operations to maximum efficiency.

Please understand where I'm coming from: Using hwy in lux is just my test-bed. The heavy stuff is my library vspline in the back, which uses generic transform-like routines to apply SIMD code to arbitrarily strided nD arrays of multichannel data. If the lux test-bed tells me that it's worth my while changing things in vspline, I'll eventualy move it over to vspline. I sure don't want to have a large set of special-case functions to cover 2, 3, and 4 channel data in my library code, but I'd love to have a viable option concerning the SIMD library, because Vc doesn't seem to be going anywhere and std::simd is lacking half the functionality I need. What I call the 'bunch' and 'fluff' operations, fetching a set of vectors from possibly strided interleaved xel memory and putting it back, is right at the core of vspline's 'wielding' code and makes a big difference if load/shuffle shuffle/store code is available.

jan-wassenberg commented 2 years ago

Yeah, the operators do help readability and I'd love to enable them, but that requires more work in clang+GCC+MSVC (for SVE).

OK, I'll bump up the priority of LoadInterleaved/StoreInterleaved2 :) I understand about the usefulness of a generic case but perhaps I'm not the right person to write such code because I haven't been thinking about that yet.

kfjahnke commented 2 years ago

I think it's enough for me to have the operators in my simd_type - all the code in my application uses SIMD via this interface which mimicks Vc::SimdArray, and inside the class I use whichever functions or operators the SIMD library provides. I would not leak details of SIMD usage to the application proper. The Vc implementation of IterleavedMemoryWrapper is a tough nut, so far I haven't quite understood the code. Vc code is using every trick in the C++11 book, it's full of enable-if's and the likes, which make it hard going - and comments are few and far between. But, hey, you're the new SIMD guru, I thought you might be able to understand it and provide an equivalent in highway! For the time being I use 'regular' gather/scatter operations ( indexes = k * iota() ), which, at least on AVX2, perform quite well. Even the goading variant, which my code falls back to if it gets something as indexes which isn't quite index_type itself is quite good - it's one of the loops which autovectorize well.

kfjahnke commented 2 years ago

I understand about the usefulness of a generic case but perhaps I'm not the right person to write such code because I haven't been thinking about that yet.

I'm now back with the gather/scatter vs. load/shuffle/store issue. While a general solution would be nice, I see that this is probably an unreasonable demand. I set out to understand how Vc does it, and I could not really find anything general. So I did not succeed there. I decided to be content with what hwy offers for now, but the next thing I noticed was that the hwy interleaving code is only good for uint8_t. I had hoped that I could use it for float values - all my rendering code uses single precision float, so that's what I need to de/interleave most of the time. Could you maybe widen the scope?

jan-wassenberg commented 2 years ago

:)

Ah indeed it's u8 only. Sure, can expand to float32 for the moment if that's enough? LoadInterleaved is also coming soon.

kfjahnke commented 2 years ago

You're a star! Yes, float32 would be pretty much the ticket, and if you add LoadInterleaved, and offer both for 2,3 and 4 channels that should cover just about all of my rendering code.

kfjahnke commented 2 years ago

Just to cheer you up, some of my rendering benchmarks now run faster with highway as the SIMD backend than they do with my Vc-based implementation. Adding the de/interleaving code may make the final difference!

I hope you don't find this presumptuous, but I'd like to invite you to try my application, lux, as a benchmarking tool. Lux does very nice continuous pans over a fixed number of frames, and when it terminates, it echos the average number of milliseconds it has used to render a frame. A 1000-frame pan yields an average rendering time which reproduces very well - with minor variations depending on core temperature and fluctuating system load for background tasks. You can set up various different lux builds by configuring with cmake - giving you binaries using goading, Vc, std::simd and highway, readily rolled-out to several ISAs (like, on i86, SSSE3, SSE4.2, AVX, AVX2, AVX512f) in a single binary so you can pick the ISA with a command line parameter. Your input images don't have to be panoramas, you can simply pretend they are - lux will do the goeometric transformations regardless. Here's an example using the highway backend on a haswell core i5, first with SSSE3 (-i3), then with AVX2 (-i2). The second one is done a few times to show you how well the benchmark reproduces.

> lux -i3 -P -z1000 -A.05 -ps -h360 some_image.jpg
  ...
displayed 1000 frames, which took 20.42200ms on average to create

> lux -i2 -P -z1000 -A.05 -ps -h360 some_image.jpg
  ...
displayed 1000 frames, which took 11.54809ms on average to create

repeats:

displayed 1000 frames, which took 11.47905ms on average to create
displayed 1000 frames, which took 11.47224ms on average to create

The average rendering times you get from lux are a higher-level benchmark than 'normal' benchmarking code - it's like 'application benchmarks' which test common applications on different machines rather than synthetic benchmarks which are tailor-made to look at specific features. So using lux for benchmarking SIMD code gives you a different view, and because it also relies on my wrapper class simd_type, you can experiment with stuff like larger lane counts (which the wrapper rolls out to multi-Vec operations) - lux uses that per default, because it's faster than using single hwy::Vecs, but it's all configurable, typically with a single change to the source, because it's all quite orthogonal template metacode and lux uses the functional paradigm.

While I'm singing the praises of my software, let me add another presumtuous suggestion. I have mentioned before that my library vspline has what I call 'wielding code', which rolls out SIMD code over nD arrays of 'xel' data using multiple threads. I found that this code is absolutely essential, so I suggest you cast an eye on it. It's extremely easy to use: once you have a class derived from vspline::unary_functor containing the xel pipeline code, and an nD array of xel (passed as vigra::MultiArrayView, a thin wrapper around a pointer, a shape and a set of strides) you can use vspline's transform family, which has variants taking input as discrete target coordinates or arrays of input data, and offers 'apply' as well. The whole nitty-gritty of distributing workload tickets to the worker threads, creating sets of vectors from the data and feeding them to the SIMD code is coded in vspline, all you need is a simple functional construct and the call to 'transform'. The functional constructs can oftentimes be 'magically' derived from a template which doesn't even use any explicit SIMD code. You can have a rich application up and running very quickly. In a way it's an ideal tool set to combine with a SIMD library: the SIMD library handles the variety of CPUs and their instruction sets, simd_type wraps that in a common interface, and vspline uses the common interface to roll the code out over arbitrarily strided nD raster data. The name 'vspline' suggests that the library is mainly a b-spline library, but the 'wiedling' functionality which evolved in it is universal - it just happens to fit very well into a b-spline library, when you want to code interpolation with b-splines. Lux, in turn, is the testbed for vspline, and once new code has proven useful in lux, it moves upstream.

jan-wassenberg commented 2 years ago

Thanks, that is indeed nice to hear. I like your idea of benchmarking via lux, I haven't seen many that are able to compare SIMD wrappers directly.

FYI the StoreInterleaved* update is in progress. If we're adding 32-bit, it's not much harder to support all types, so will do that. 1,000 line patch and counting :)

jan-wassenberg commented 2 years ago

StoreInterleaved for all types has now landed :)

kfjahnke commented 2 years ago

I already pulled the new code and was just about to route to it... I'll let you know how it works for me! Thanks for taking that on, I see it's a big coding effort, but I think it will be a valuable addition.

kfjahnke commented 2 years ago

It works here in lux! And it does speed things up. I have a good test for the use of interleaved memory, using lux: stitching a panorama. This code uses a lot of intermediate arrays of data, which all need to be saved to memory (now using the new code you provided) and read again later (which will use the LoadInterleaved variants once they become available). On AVX2, I get measurable but moderate speedups (2 measurements each):

reference, using Vc                         21.27300 sec, 21.28300 sec
highway variant without StoreInterleaved    21.64900 sec, 21.69700 sec
new highway variant with StoreInterleaved   20.86700 sec, 20.81200 sec

With SSE4.2, I get this:

reference, using Vc                         29.42900 sec, 29.42100 sec
highway variant without StoreInterleaved    28.71300 sec, 28.79600 sec
new highway variant with StoreInterleaved   28.66500 sec, 28.42200 sec

Surprise! Even the variant without StoreInterleaved is faster than my reference implementation, and the difference to the new code is not very significant - I'd have to repeat the measurements many times to get a good average, this is just a quick shot. So these are very nice results, considering that the memory access is only a small part of what's going on.

jan-wassenberg commented 2 years ago

Nice! Great to have a repeatable benchmark.

FYI this doesn't affect your float32 code, but there is a bug in the i16x4 path, shuf_B2 = shuf_B0 should be shuf_B2 = shuf_A0. It's unfortunate the existing test didn't catch this, but new ones for the upcoming LoadInterleaved did.

kfjahnke commented 2 years ago

No problem - for the time being I'm avoiding mixing in short data, so it did not trip up my code. Looking forward to the new LoadInterleaved! - my highway variants now tend to surpass the Vc code, so I may eventually 'decamp'. I've factored out my de/interleaving code to https://bitbucket.org/kfj/pv/src/master/vspline/interleave.h, 'bunch' and 'fluff' are now free functions, and 'bunch' is only waiting for LoadInterleaved :D

kfjahnke commented 2 years ago

ping to krita. @amyspark, do you follow this? It might help with your porting effort!

jan-wassenberg commented 2 years ago

FYI LoadInterleaved2+4 are done; I've found a way to reuse much more code, so it's only 200 lines. LoadInterleaved3 is yet to be done, that's probably going to take another few days.

kfjahnke commented 2 years ago

Okay, great, I have code ready to try all variants in lux - as soon as your code comes online!

kfjahnke commented 2 years ago

Your docu got jumbled, the text for LoadInterleaved[34] in the quick reference shows a wrong signature.

jan-wassenberg commented 2 years ago

Thanks for catching that, fixed :)

kfjahnke commented 2 years ago

I've changed my code to use the new functions, and all seems well. Thanks for your effort! Speed-wise, the results are inconclusive (using gather/scatter for de/interleaving is also quite fast), but the new code performs well, and lux with highway is now roughly en par with my Vc-based implementation. The next interesting thing which I'm now looking forward to is seeing highway-based lux builds on non-i86 platforms. Until now, the only code I could provide for those platforms was my 'goading' implementation, so I anticipate improvements with the highway version. I've further 'commodified' the use of my SIMD implementations, there are now separate versions of simd_type.h for vc, highway and std::simd, and their structure is very similar, so they can be used in the same way.

jan-wassenberg commented 2 years ago

Glad to hear it works. Interesting that scatter/gather is competitive, that's surprising.

kfjahnke commented 2 years ago

Warning: this is another one of my lengthy posts and it tries to lure you to strange territory...

To establish the difference between g/s and Load/StoreInterleaved, I wanted code to roll out the operation to a large set different 'xel' data - different fundamentals, different channel count, stored in arrays of different dimensions, processed with a variety of vspline functions. An ideal candidate for this roll-out is my python module for vspline, which comes with a test program 'warmup.py' doing just that, timing all operations precisely. This program was made initially to judge how much overhead the parsing and JITing of the C++ code would consume, and how the JITed code would perform, compared to a version which links in precompiled binary in a shared library. Note the repo is still lacking the highway variant, the link is just for reference.

First the good news: when I 'slot in' my hwy implementation of simd_type instead of the Vc-based implementation, cppyy accepts the code just as happily for JITing and the test runs through the entire suite without complaints. This is good news because cppyy's compiler variant is, let's say, mildly exotic. What surprised me, though, was the relatively bad performance.

So now for the bad news, which I figured out when trying to link in precompiled binary. I had compiled the shared library with flags to produce AVX2 code, and when the python side tried to invoke functions it expected in the .so, it failed: they weren't there.

It turned out that the python side expected to call SSSE3 functions, whereas the .so had AVX2 code. And if I compiled the .so with flags to produce SSSE3 code, it linked just fine and was even quite fast - but I reckon AVX2 would be significantly faster.

cppyy compiles the C++ code to run on the current machine - everything else would not make sense, because the result of 'pulling in' the C++ code does not persist. Hence the part of the code which is JITed by cppyy is - AFAICT - done with -march=native (or something similar to this effect). But whatever information cppyy passes 'into' the C++ code it consumes (in terms of 'what CPU are we running on') seems to be insufficient to make highway 'decide' to use the right, 'native', ISA. This is a bit like the problems I had initially in getting hwy to produce AVX2 binary - I had to pass additional flags. So much for my musings, here's hard evidence:

.../vspline$ python
Python 3.8.10 (default, Mar 15 2022, 12:22:08) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cppyy
>>> cppyy.include ( "hwy_simd_type.h" )
True
>>> v = cppyy.gbl.vspline.hwy_simd_type(float,16)
>>> v
<class cppyy.gbl.N_SSSE3.simd_type<float,16> at 0x58053d0>
>>> v.vec_t
<class cppyy.gbl.hwy.N_SSSE3.Vec128<float,4> at 0x584c6a0>
>>> cppyy.include ( "Vc/Vc" )
True
>>> cppyy.gbl.Vc.float_v
<class cppyy.gbl.Vc_1.Vector<float,Vc_1::VectorAbi::Avx> at 0x6648520>
>>> cppyy.cppdef ( "int z = sizeof(Vc::float_v) ;" )
True
>>> cppyy.gbl.z
32

First, I 'pull in' hwy_simd_type.h, which has everything bundled to give instant access to the SIMD implementation. I instantiate the hwy_simd_type template for 16 floats, and the type info already shows the hwy namespace; further inspection shows the underlying hwy vector to have four floats. Next I pull in Vc and look at a Vc::float_v. You can see that Vc has produced an Avx vector with size 32 == eight floats. So Vc 'gets it', but hwy doesn't.

After this lengthy excursion, a simple question: for the sake of my test, is there a quick way to tell highway inside the C++ code to use a specific ISA? Then I could provide a broad comparison of goading vs. g/s vs. Load/StoreInterleaved, which would tell us just how much your recently added de/interleaving code helps performance on some target ISA.

The harder question is this: how can hwy figure out better which ISA to pick? I suspect that cppyy might be made to emit more information about the current CPU - or hwy might be made to look at information cppyy provides which it currently ignores.

Aside from these 'teething problems', I hope I have whetted your appetite: cppyy is an excellent tool to grab a chunk of C++, make it's every detail available to a python process and 'run' it from there. It's excellent for testing, because you can write the code of the test in python, with short turn-around times and all the benefits of an interpreted language.

jan-wassenberg commented 2 years ago

Interesting, a Python wrapper sounds potentially useful.

The target detection logic is in detect_targets.h. You can indeed #define HWY_BASELINE_TARGETS HWY_AVX2 - this will override the detection and likely enable AVX2. The caveat is: Highway includes some safety checks to disable AVX2 on older compilers where we've encountered compiler bugs. That may be what you're seeing? That can also be overridden by #define HWY_BROKEN_TARGETS 0.

If that's still not it, the difference with VC might be that we're pickier about which CPU flags constitute AVX2: PCLMUL, AES, BMI2, FMA, F16C. It's possible cppyy is not defining one of those.

kfjahnke commented 2 years ago

Thanks, the override you proposed works here - it was enough to define HWY_BASELINE_TARGETS.

Using the override to fix the code to SSE4.2 and AVX2, I could not detect much difference between using the new de/interleaving functions and a set of gather/scatter operations to interface with the memory. This is surprising! I would have assumed that, especially on SSE4.2, which I think has no proper gather/scatter support, the load/shuffle shuffle/store approach would make a measurable difference, whereas on AVX2 I would have expected a more similar performance. Instead, I find that the AVX2 code benefits from the new de/interleaving functions, but only by about 4%, whereas the SSE4.2 code seems to be a tad slower with the new de/interleaving functions.

This result may of course be due to several factors - the memory access in itself may be very much the limiting factor, and the precise way of how the cache lines interface with the SIMD registers may be irrelevant by comparison. My tests 'plough through' nD arrays with 1e6 elements, doing a bit of arithmetic on the data, so there is a lot of memory traffic.

The tests also shed more light on my simd_type implementation with highway, vs. the Vc variant. Altogether, my hwy-based implementation comes out slightly slower in this test than the vc-based one (ca. 10%), which sounds bad, but is actually good news, because the code is new and hasn't seen very much tweaking - and it was written to emulate the Vc code, so some things were done to mimick Vc closely which might be better implemented by writing more 'hwy style' code.

Interesting that scatter/gather is competitive, that's surprising.

Did you do any benchmarking comparing the two modes of operation? Maybe I've made a mistake. But the application-level tests with lux did also point in this direction, I just did not find them so significant because there is little deinterleaving in lux.

jan-wassenberg commented 2 years ago

Good to know. So looks like one of the required feature flags was not set by the compiler.

LoadInterleaved3 is 3 shuffle, 4 blend; on AVX2 add 2 blends and one cross-block shuffle. 32-bit gather is 10 cycle rthroughput, which is still a bit slower, especially if you can't hide their 20 cycle latency. For 32-bit and SSE4, it makes sense that scalar emulation can compete. Do you think we should add a special case for that?

kfjahnke commented 2 years ago

So looks like one of the required feature flags was not set by the compiler

@wlav has clarified this, and with appropriate flags passed in via EXTRA_CLING_ARGS I have managed to solve the problem - the flags are correctly passed through and hwy reacts correctly.

While working with the hwy variant of my python interface, I had problems accessing AllocateAlignedBytes and FreeAlignedBytes, the functions were not found. I had built hwy with the default settings, which produce static libraries (.a). When I built shared libraries instead, the functions were found. Do you have an idea what might cause this?

Concerning special-casing Load/StoreInterleaved, I'm not competent to make suggestions. I'm happy to have the functions to call, and to remain blissfully ignorant of how they are routed 'further down'. I think it's good to have the functions present, because they clearly state the caller's intention to de/interleave. Then they can evolve to provide the most effiicient implementation, with the details remaining opaque to users.

I think we have a slight mismatch of design philosophy: I'm generous on the interface side and allow a wide range of signatures, routing stuff I don't have specialized code for through some generic construct. hwy seems to rather follow a philosophy of 'if there is no specialized fast code for it, we don't offer it to the user', which does prevent the user from making bad choices performance-wise, but it's sometimes a real nuisance when something you thought should work turns out to be missing because there is no efficient implementation for it. Of course the 'provide-something-generic' philosophy can end up producing 'Potiemkin' code (remember atan2) where users think they are calling specialized code when in fact they are not. It's a thin line to walk. Maybe one could introduce further flags to dis/allow code which does not perform well - for rapid prototyping or porting, anything will do to get the thing up and running initially. Then, when tweaking for performance, it's nice to be told when one is using slow fallback code.

... especially if you can't hide their 20 cycle latency

My hwy_simd_type is most efficient on AVX2 when using the number of hwy vectors equivalent to 4*HWY_MAX_BYTES. This may well be due to costs to set up the processing units, which can be 'amortized' by subsequently running the intended operation several times - all of this is a bit of 'black magic' where the CPU has capabilities we rarely think of, like internal pipelining, branch prediction, speculative execution etc.. I reckon you're much closer to the bare metal than I am, with my rather high-level generic constructs, so you should be better at making judgements about which optimizations to try.

kfjahnke commented 2 years ago

FYI - I've uploaded the extended version of my bspline python module, now with optional use of highway for SIMD - via my hwy_simd_type wrapper class.

jan-wassenberg commented 2 years ago

Glad to hear we understand the cause for the missing feature flag.

What's the problem with AllocateAligned? If I understand correctly, it works with a shared library build (for which you defined BUILD_SHARED_LIBS in CMake?). But you got a linker error with a static library? That's strange, highway_export.h should be defaulting HWY_DLLEXPORT to nothing (unless HWY_SHARED_DEFINE is defined) so those would just be normal functions, and aligned_allocator is part of HWY_SOURCES which define the hwy library.

hwy seems to rather follow a philosophy of 'if there is no specialized fast code for it, we don't offer it to the user', which does prevent the user from making bad choices performance-wise, but it's sometimes a real nuisance when something you thought should work turns out to be missing because there is no efficient implementation for it. Of course the 'provide-something-generic' philosophy can end up producing 'Potiemkin' code (remember atan2) where users think they are calling specialized code when in fact they are not. It's a thin line to walk. Maybe one could introduce further flags to dis/allow code which does not perform well

Yes, that's a good summary, and indeed it's a fine line. A flag might be OK, but is fairly thin protection - once we opt in, there is no further differentiation or guiderail. Do you have an example of what functions are sorely missed? Perhaps we could also use deprecation warnings in the compiler? Those can point to individual usages.

kfjahnke commented 2 years ago

What's the problem with AllocateAligned? If I understand correctly, it works with a shared library build (for which you defined BUILD_SHARED_LIBS in CMake?). But you got a linker error with a static library? That's strange, ...

It's a bit more complicated. What I try to do is pull the static library into a shared library, to bundle all symbols I want to access in one handy .so package, as I do with libVc.a. In my makefile I have:

get_hwy.so: get_hwy.cc
    clang++ march=native -std=c++11 -O2 -fPIC -shared -o get_hwy.so get_hwy.cc -lhwy

and the file get_hwy.cc only includes some header files:

#include <hwy/highway.h>
#include <hwy/contrib/math/math-inl.h>
#include <hwy/aligned_allocator.h>
#include "vspline/hwy_atan2.h"

But the resulting shared library get_hwy.so contains practically no symbols:

$ nm -D get_hwy.so
                 w __cxa_finalize
                 w __gmon_start__
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable

It must be some silly mistake I'm making, because the symbols I want are clearly there, both in the static and the shared lib, and if pull in the shared library directly, I have no problem.

$ nm /usr/local/lib/libhwy.a | grep AlignedBytes
00000000000000c0 T _ZN3hwy16FreeAlignedBytesEPKvPFvPvS2_ES2_
0000000000000000 T _ZN3hwy20AllocateAlignedBytesEmPFPvS0_mES0_

$ nm -D /usr/local/lib/libhwy.so | grep AlignedBytes
0000000000002460 T _ZN3hwy16FreeAlignedBytesEPKvPFvPvS2_ES2_
00000000000023a0 T _ZN3hwy20AllocateAlignedBytesEmPFPvS0_mES0_

I just thought you might have seen similar problems and have a solution handy, like #define XYZ.

Do you have an example of what functions are sorely missed?

When you're asking for functions, tan, pow, atan2 spring to mind which I have to provide in my wrapper, where atan2 already has a good solution. But I rather miss stuff which is present only for certain types/ISAs and not for others: One example is division, I think we've discussed this. As a user I'd expect both float and integer division, both accessible via an operator function and via 'Div'. And then I'd like stuff like rdiv, and possibly even % and fmod, even if there are no patent SIMD instructions for these operations: the library could provide a good implementation, rather than burdening me with piecing one together, or falling back to goading. Another thing I struggle with is indexing for g/s operations. I think the library might well provide indexing with arbitrary integral types, to e.g. allow g/s of small values which are far apart, or in situations where g/s of some type is not directly supported by a given ISA. Falling back to goading is one way of dealing with that - always with the hope that the compiler will sort it out with autovectorization - but I think a layer of explicit code might perform better. In both these cases, the library might provide code even where a direct SIMD implementation is not available, but here is where the library might have a way of warning the user that the operation may not be the most efficient choice, or forbidding such a choice if some compile-time flag says so: an opt-out of the helpful, but less efficient 'bridge' code. Just like you might #define assertions away when your code goes to production.

jan-wassenberg commented 2 years ago

clang++ march=native -std=c++11 -O2 -fPIC -shared -o get_hwy.so

Ah, I see. highway_export.h requires HWY_SHARED_DEFINE to be defined, otherwise it doesn't mark functions as exported/visible.

Thanks for sharing the examples!

tan, pow, atan2

Patches for those are very welcome :D It seems like those can and should be implemented natively without requiring a "this is slow" flag.

Can you help me understand the benefit of rdiv? Seems that we could just swap the arg order (or provide a wrapper that does that).

So far it seems like int division/modulo and fmod would be a good example of something that might benefit from a SLOW macro/flag. A patch for those would also be welcome :)

kfjahnke commented 2 years ago

Ah, I see. highway_export.h requires HWY_SHARED_DEFINE to be defined, otherwise it doesn't mark functions as exported/visible.

Do I need to #include highway_export.h? And should I #define HWY_SHARED_DEFINE in get_hwy.cc?

jan-wassenberg commented 2 years ago

The Highway headers and source files already include highway_export. I believe you'd need to define HWY_SHARED_DEFINE in the entire build to make sure that the Highway sources (which will produce the libhwy.so) see that.

kfjahnke commented 2 years ago

On top of setting HWY_SHARED_DEFINE I needed an additional linker parameter to pull all symbols from the static highway libraries into the shared library get_hwy.so I was building. The makefile now says:

get_hwy.so: get_hwy.cc
    clang++ $(ccflags) -o get_hwy.so -Wl,--whole-archive /usr/local/lib/libhwy.a /usr/local/lib/libhwy_contrib.a -Wl,--no-whole-archive get_hwy.cc -lhwy -lhwy_contrib

note the linker argument --whole_archive, which pulls in all symbols from the static lib. I had assumed wrongly that all symbols from the static libraries would be pulled into the shared library by linking to the static libs with -lhwy and -lhwy_contrib.

With these modifications I now have the symbols in the shared lib:

$ nm -D get_hwy.so | grep AlignedBytes
0000000000003520 T _ZN3hwy16FreeAlignedBytesEPKvPFvPvS2_ES2_
0000000000003460 T _ZN3hwy20AllocateAlignedBytesEmPFPvS0_mES0_

and I can link that to my application to 'pull in' highway.

Can you help me understand the benefit of rdiv

I meant remainder division, but I got the token wrong - in the standard library it's std::div. Another handy function which returns two values is sincos, which returns both the sine and cosine. This might also be a candidate to include in hwy.

kfjahnke commented 2 years ago

With the python module now functioning as intended, I can give a figure for the effect of the Load/StoreInterleaved functions, vs. using g/s. With my test running over a wide range of vspline functions (warmup.py) I get a speedup of 3.2%. That's a fair amount! Scale that up to a server farm, you save Millions :D

jan-wassenberg commented 2 years ago

Glad to hear it works now :)

Ah, sincos indeed sounds useful and would be happy to have that in the math library as well.

3% sounds worthwhile indeed for just a few instructions added, thanks for sharing the result!

kfjahnke commented 2 years ago

Just a quick update: My stand-in class mimicking Vc::SimdArray has now been in use (and tweaked) for a good while, and it seems to be running smoothly. I have interesting feedback from freeBSD who offer a port of lux for their OS, where the maintainer has told me that he switched to using the highway-based SIMD backend after discovering that the Vc-based one did not produce the SIMD instructions he was expecting to see. I also had reports of a lux build for an Apple M1, producing a significant performance improvement over using 'goading' code, which was the only option I had for Apple silicon before the highway-based backend became available. In my own use of the highway-based backend, I have noticed steady performance improvements over time and whenever I pulled new code from your repo, it slotted in nicely without any hiccoughs. For the systems I have access to (up to AVX2), highway-based binaries are now roughly en par with Vc-based binaries, and I suspect that on better i86 ISAs they will outperform the Vc-based code. Due to highway's wider range of supported hardware I think I may make it the default backend for my SIMD-using code. highway has given me the option to opt out of using Vc, which has now switched to mainenance mode and will not support newer ISAs. And it's enabled me to widen my 'feeding spectrum' to non-i86 CPUs. Thanks from a happy user for keeping up the good work!

jan-wassenberg commented 2 years ago

That's great to hear. Thank you kindly for sharing your observations and results, and congrats on successfully integrating with your array wrapper :D