google / pik

A new lossy/lossless image format for photos and the internet
MIT License
831 stars 51 forks source link

Allow SIMD usage outside of pik namespace #65

Closed cwfitzgerald closed 4 years ago

cwfitzgerald commented 5 years ago

The SIMD library in here is very nice and I have been trying to use it in a project. There is a single minor change to allow these macros to be used outside of the pik namespace.

The only difference is here: https://github.com/google/pik/blob/master/pik/simd/shared.h#L86-L87

#define SIMD_FULL(T) Full<T, SIMD_TARGET>
#define SIMD_PART(T, N) Part<T, N, SIMD_TARGET>

becomes

#define SIMD_FULL(T) Full<T, ::pik::SIMD_TARGET>
#define SIMD_PART(T, N) Part<T, N, ::pik::SIMD_TARGET>

and everything works well.

Aside: Have you guys considered separating the simd project out? it is one of the best out there that I've found.

jan-wassenberg commented 5 years ago

Thank you for your kind words and interest! This looks like a good change.

FYI we have simplified the macros, SIMD_FULL now expands to a number of lanes rather than a target struct, and a new SIMD_CAPPED is the same but with SIMD_MIN(num_lanes, requested_lanes).

I would definitely like to take the time to make this library more easily available. FYI we have subsequently added AVX-512, but are currently focused on the JPEG-XL specification. Hopefully I can take some time for the SIMD library in August.

cwfitzgerald commented 5 years ago

Awesome!

In my currently-not-on-github fork of it, I actually added a SIMD_TARGET_WIDTH macro as I wanted to conditionally compile based on the width of vector, so that works perfectly.

One question/suggestion, is there a way to promote a SIMD_PART(int64_t, 2) vector to a SIMD_PART(int64_t, 4) vector on a 256bit target? Or be able to combine two 128's into a 256? There doesn't seem to be a function that maps to _mm256_zextps128_ps256 or _mm256_set_m128 and friends directly. It appears that get_half could be used to go the other way. There probably should be a get_quarter and conversions from 128 -> 512 and 256 -> 512. Idk if you guys have implemented this internally already.

I personally don't have much need for AVX-512 because I'm making a game engine targeting consumer hardware, so even if they have a AVX-512 unit I can't eat the clock speed changes, but the more complete this library is the better!

I don't know if you know but you got featured on the microscopic subreddit /r/simd, which is how I found this 😄

Until y'all get some time, I'll keep working on minor changes on my personal fork. Thanks again!

jan-wassenberg commented 5 years ago

Ah, makes sense. FYI we also added a SIMD_BITS for that purpose.

Interesting question, I'm not very happy with this aspect of the API. We haven't had a solid use case for that, can you share some detail on where you'd use that 128->256? We had a broadcast_part which is similar but should perhaps be replaced with what you're suggesting. FYI the idea was to avoid penalty on VSX, which prefers to have parts in the upper lanes, but I'm not sure that was worthwhile.

AVX512: makes sense, but have you seen this discussion (https://lemire.me/blog/2018/08/13/the-dangers-of-avx-512-throttling-myth-or-reality/) on the actual impact? The situation is more nuanced than I initially believed.

jan-wassenberg commented 5 years ago

(PS: thanks for mentioning r/simd, I replied there.)

cwfitzgerald commented 5 years ago

The use case I found is with swizzles. I was trying to swizzle a AOS to an SOA with three 64bit elements per structure. This works out to exactly three 128bit instructions to swizzle two structures at a time. 256 bit on the other hand is harder and would most likely end up with more instructions (also with being harder to think about) because of the limited ability to swizzle cross-blocks. I would like to write code that looks something like this:

SIMD_FULL a, b, c;
#if 128bit or 256bit
vl1, vl2, vl3 = swizzle first 2 structures
#if 128bit
a = vl1; b = vl2; c = vl3;
#else // 256bit
vh1, vh2, vh3 = swizzle next 2 structures
a = combine(vh1, vl1); b = combine(vh2, vl2); c = combine(vh3, vl3)
#endif
#endif
// Do work at full vector length

I hope that makes sense. I want to swizzle at 128bit then do math at N bit. This is actually one of the big things that I like about this library over something like ISPC. ISPC is married to a single bit width for every "pass" over your code, whereas here, I can use partials to do 128bit operations when compiling for 256 or 512bit.

re AVX512: that particular article seems to be with programs that hit AVX512 really hard. Being a game, a good portion of my stuff is serial, so the fraction of time spent deep within simd would be pretty low. It would be interesting to benchmark my game running in each mode and see how simd affects framerate and microstutters.

cwfitzgerald commented 5 years ago

For a more specific use case, here's an example from my actual code.

These are the changes I made on my fork: https://github.com/BVE-Reborn/pik-simd/commit/49bf06d15c836ee8d932c12eb24c975f6f22db89

This is how I use it. (I am assuming AVX-512 doesn't exist because my version doesn't have it): https://gitlab.bvereborn.com/cwfitzgerald/pik-simd-tests/blob/master/fwd.inl

jan-wassenberg commented 5 years ago

Thanks for the example. I'm convinced that combine is useful. Promote could be confused with type conversion, perhaps zero_extend instead?

cwfitzgerald commented 5 years ago

I thought about it, I just was worried about getting confused with zero extending the underlying values to a larger T. Maybe zero_extend_vector. I would be okay with any of the three as I was just getting something down as a proof of concept.

jan-wassenberg commented 5 years ago

Fair point, zero_extend_vector sounds good to me, added to the post-August TODO list :)

jan-wassenberg commented 4 years ago

@cwfitzgerald Good news, https://github.com/google/highway/ now has the standalone library with some major enhancements. Would be happy to continue the discussion there.

cwfitzgerald commented 4 years ago

That's amazing! Closing as out of scope.