google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch
Apache License 2.0
4.05k stars 308 forks source link

Indices <-> Vec #416

Closed yzazik closed 2 years ago

yzazik commented 2 years ago

Hello, first, hat's off, it's all great, I mean, took me a few turns to find the ramp down to this highway and I could argue about some (probably half) of the things in design philosophy (too bad there is no Discussions tab), but now I have a couple of immediate questions:

  1. VI TableLookupBytes(V bytes, VI from) doesn't actually return VI but V, how do I convert one to/from another?
  2. About V TableLookupLanes(V a, VI). What's the rationale of having float/double indices? Thank you.
jan-wassenberg commented 2 years ago

Hi @yzazik , feel free to open another issue for discussing the philosophy :)

Thanks for pointing this out. A bit of background: TableLookupBytes initially required indices/table of the same type (integer) and count. We had a use-case for a half-vector of indices but full-length table, and also mixing signed/unsigned indices/tables. Unfortunately looks like I introduced several discrepancies when allowing this:

Scalar: HWY_API Vec1<T> TableLookupBytes(const Vec1<T> in, const Vec1<T> from) {
RVV: HWY_API V TableLookupBytesOr0(const VI v, const V idx) {
x86_512: HWY_API Vec512<T> TableLookupBytes(Vec512<T> bytes, Vec512<T> from) {

Is that what you are asking about? Are there any others? The intention here is to have from/idx have lane type TI, and also return that; we can fix that.

For TableLookupLanes, there's no intention of allowing float/double indices, VI denotes integers. It may compile with float indices on some platforms with lax conversions between vectors, but not all. We could add some static_asserts?

yzazik commented 2 years ago

Thank you so much for the reply, having someone around to help made me somehow more appreciative and this project more valuable. I'm doing some 8/16-bit image processing and, as you can imagine, I lack quite a few of tools here, this could get philosophical as to why 8/16-bit tooling got... left behind, I also wonder how much of that the jpegxl guys had to add to deal with those. Regardless, I started bringing 8/16-bit tools in (logical shift, rotate, etc), eventually, I would need to make a portable 8-bit Compress(), remains to be seen whether I would get better performance running that or promoting/demoting to/from 16-bit on NEON, still have some troubles compiling SVE. Somehow I was under impression that TableLookupBytes() and TableLookupLanes() should behave the same for 8-bit lanes, but underneath it was a shuffle against permute, and I was still fluorescent green deciphering what's I in VI (integer or indices) and looking at those two:

template <typename T, typename TI>
HWY_API Vec256<TI> TableLookupBytes(const Vec256<T> bytes, const Vec256<TI> from)
{
  return Vec256<TI>{_mm256_shuffle_epi8(bytes.raw, from.raw)};
}

HWY_API Vec256<float> TableLookupLanes(const Vec256<float> v, const Indices256<float> idx)
{
  return Vec256<float>{_mm256_permutevar8x32_ps(v.raw, idx.raw)};
}

I though I stands for indices and then I saw floating indices which rose a question too. static_asserts with human description for templates really help, those template errors are often longer than my brain waves.

jan-wassenberg commented 2 years ago

@yzazik

I would need to make a portable 8-bit Compress(), remains to be seen whether I would get better performance running that or promoting/demoting to/from 16-bit on NEON, still have some troubles compiling SVE.

What's the issue with compiling SVE?

Some ops are not supported for 8-bit or even 16 because the hardware support is lacking. For example, 8-bit compress is quite difficult on AVX2 - we'd probably need to split that into 4 separate writes (otherwise the lookup table is too large), and it's not supported in AVX3 until VBMI2. I'd recommend promoting to u16.

I was under impression that TableLookupBytes() and TableLookupLanes() should behave the same for 8-bit lanes, but underneath it was a shuffle against permute

Ah, the docs do mention that TableLookupBytes splits the input into independent blocks, but that's in a section header above the documentation for that particular function.

I though I stands for indices and then I saw floating indices which rose a question too.

That's right. Ah, I see, Indices256 is actually also integer, it's just type-tagged so that we do not accidentally mix up indices for float and double etc.

Thanks for pointing to these, I will add some explanation of VI, that indices are always integers, and fix the TableLookupBytes type inconsistency.

yzazik commented 2 years ago

Still puzzled about conversion of Vec to/from Indices, I'm constructing indices myself in the Vec, then I need to use it as Indices in TableLookupLanes(), how would I do that? I mean, it looks like it's either through memory Store()/SetTableIndicies() or go down to .raw.

jan-wassenberg commented 2 years ago

@yzazik good point, we didn't yet need that. Currently the best you can do is Store+ SetTableIndices as you say - .raw will not work with SVE. We can add template HWY_API unspecified_indices TableIndicesFromVec(V vec), does that sound like the right interface?

yzazik commented 2 years ago

I think just IndicesFromVec() is long and elaborate enough. I mean, there are no actual (in-memory LUT) tables in here, even TableLookupLanes() name seems a bit off since it's is all in-vector (permute) thing. Thank you.

yzazik commented 2 years ago

Works, great job, love it, many thanks.

jan-wassenberg commented 2 years ago

You're welcome, thanks for letting us know :)