vfdff commented 1 month ago

tbl.cpp: https://gcc.godbolt.org/z/3W5e17o9f


#include <bits/stdc++.h>
#include <arm_sve.h>
using namespace std;
#define N 100

// Make sure the src small, which can be hold by table, here assume the -msve-vector-bits=512

define Length 16

void foo (float dest[N], unsigned int index[N], float src[Length], int n) {

ifdef TABLE

svbool_t predict = svptrue_b32();
svfloat32_t srcv = svld1_f32(predict, src);
for (int j = 0; j < n; j += Length)
{
    predict = svwhilelt_b32_s32(j, n);
    svuint32_t indexv = svld1_u32(predict, index + j);   // s 32 , 4 [0 ~ 15]   , 512f 
    svfloat32_t tmp = svtbl_f32(srcv, indexv);
    svst1_f32(predict, dest + j, tmp);
}
#else
#pragma clang loop vectorize(assume_safety)
for (int j = 0; j < n; j++) {
    dest[j] = src[index[j]];
}
#endif

}



* The above two logic is same if we assume  **-msve-vector-bits=512**, which guard the src is small enough to hold by Z register, so use tbl can save a load in loop body

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-aarch64

Author: Allen (vfdff)

* tbl.cpp: https://gcc.godbolt.org/z/3W5e17o9f ``` #include <bits/stdc++.h> #include <arm_sve.h> using namespace std; #define N 100 // Make sure the src small, which can be hold by table, here assume the -msve-vector-bits=512 #define Length 16 void foo (float dest[N], unsigned int index[N], float src[Length], int n) { #ifdef TABLE svbool_t predict = svptrue_b32(); svfloat32_t srcv = svld1_f32(predict, src); for (int j = 0; j < n; j += Length) { predict = svwhilelt_b32_s32(j, n); svuint32_t indexv = svld1_u32(predict, index + j); // s 32 , 4 [0 ~ 15] , 512f svfloat32_t tmp = svtbl_f32(srcv, indexv); svst1_f32(predict, dest + j, tmp); } #else #pragma clang loop vectorize(assume_safety) for (int j = 0; j < n; j++) { dest[j] = src[index[j]]; } #endif } ``` * The above two logic is same if we assume **-msve-vector-bits=512**, which guard the src is small enough to hold by Z register, so use tbl can save a load in loop body

dzaima commented 1 month ago

This isn't really SVE-specific, nor even aarch64-specific; NEON could support vectorizing similarly too with tbl its (although for non-8-bit data/index types it'd have to do some arithmetic to expand the indices; and of course lower cap at which it'd be profitable to do so); and x86 SSSE3 & AVX2 & AVX-512 also have some shuffles (≥AVX2 having 32-bit ones best suited for the example; and AVX-512 having complete shuffles for all of 8/16/32/64); RVV also has vrgather.vv.

IIRC a float src[16] C++ function argument doesn't guarantee that there are Length dereferencable elements in src, neither as a minimum nor a maximum, so that information would have to be provided in a different way (or checked at runtime with both a fault-only-first load of the table, and comparing indices with the presumed limit); C has float src[static 16] to provide a minimum dereferencable length I believe, but it's still legal to index outside of that.

One place where an optimization like this would be more trivial to apply would be a global array, where the size can be exactly known; especially useful for lookup tables.

LLVM IR doesn't have an architecture-independent dynamic shuffle instruction currently, which is presumably a hurdle to overcome for optimizations like this to be possible to do (especially with scalable vectors).

I happen to have very recently written implementations for doing dynamic shuffles of small tables like this for AVX2 and NEON via intrinsics; e.g. for 32-bit elements and a 16-element source table, on Haswell with AVX2 it's faster than a gather or separate scalar loads to use vpermd, even though it needs two of such and then a vpblendvb (producing 8 elements). Even 32 elements is roughly on par with the baseline, with four vpermds and three vpblendvbs. Other element types (8/16/32/64-bit) all have approximately the same maximum table source size below which they're beneficial (that being 128 bytes). Don't have non-ancient aarch64 hardware to test NEON perf on (though via the technique of unzipping the table data to a separate LUT for each byte in the element type, and merging together those tbls via zips in the loop, it ends up pretty nice; on my phone's A53 it being faster for all of 8/16/32-bit elements right up until 64 elements where tbls 64-byte cap is hit; though the bound would probably be significantly lower on more modern hardware; edit: got someone to run this on an Apple M1 - results are the same, i.e. beneficial for 8/16/32-bit elements; 64-bit is mildly better at ≤32 elements)

vfdff commented 1 month ago

Thanks for your share, can we add a __builtin_assume(index[j] < Length) to make sure it is valid for this transformation? https://gcc.godbolt.org/z/hPEPWbW34

dzaima commented 1 month ago

Ah yep, an assume does work for setting a maximum for the indices. Minimum is still moderately problematic though. Technically I think just doing a statement of (void) src[Length-1]; is enough to guarantee that Length elements of src can be read, but that's rather obscure and would probably be optimized out before anything useful can be gotten from it.

Another approach is to copy the table to a temporary buffer, giving the same guarantees as a global, https://gcc.godbolt.org/z/eGM5WE9xM; shouldn't even need #pragma clang loop vectorize(assume_safety) anymore as dest & index are separate due to TBAA, and buf is a new allocation; gcc vectorizes it (though still with a gather load not tbl of course), but clang still needs the #pragma.

vfdff commented 1 month ago

Thanks, the 2nd approach is just a alias issue, which need a separate solution, and I record that on PR100968, so on this issue, we can still focus on making use of the instruction tbl.

llvm / llvm-project

[AArch64] Using tbl to optimize indirect memory access #100689

define Length 16

ifdef TABLE