Open yanyh15 opened 4 years ago
Some info so far:
Intel intrinsics (probably the best guide; also, Intel manual really good): https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2
Vectorization examples (mainly AVX/few SVE): https://github.com/patrickf2000/vectorization-examples
The algorithm (From 11/25 meeting, despite later posting):
For translating vector expression to AST
Intel intrinsics:
AST elements:
ROSE parses expressions in postfix form; however, it comes up reversed, so we will need to parse backwards
The for-loop increment will need to be changed; on AVX-2, this will be changed to 8
Algorithm: 1) Break up variable expression into the lvalue and rvalue (the lvalue is the store, all else is load and operations) [THIS PART DONE] 2) Maintain a stack of strings 3) Loop through the expression
Example:
for (int i = 0; i<512; i++) {
result[i] = arr1[i] + arr2[i] * x;
}
Using the algorithm:
__m256 _const1 = _mm256_broadcast_ss(&x);
for (int i = 0; i < 512; i += 8) {
__m256 _vec1 = _mm256_loadu_ps(&arr2[i]);
__m256 _result1 = _mm256_mul_ps(_vec1, _const1);
__m256 _vec2 = _mm256_loadu_ps(&arr1[i]);
__m256 _result2 = _mm256_add_ps(_result1, _vec2);
__m256_storeu_ps(&result[i], _result2);
}
After this is working:
expression,
direct array references (including multiple dimensional array) using standard vector load/store
for (i ...)
for (j ....)
A[i][j] = B[j][i];
direct array reference with stride using strided load/store
Indirect array reference A[B[i]], use case (spare matrix), with gather and scatter
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]];
A[K[i]] ==> 3Address
tmp1 = K[i];
tmp2 = A[tmp1]; (option 1)
Need to store some attribute info for this two AST to indicate that these two 3-address statements are for A[K[i]] such that gather/scatter can be used for the transformation.
if/else, if expression needs to be transformed to 3Address AST and then if will just test a single variable.
while loop within a simd loop (not supported yet unless there is use case for it)
Other cases such as using struct field,
function call with regards to declare simd. Inlining is likely the approach for handling function call with vectorization. check https://community.intel.com/t5/Software-Archive/Vectorization-of-a-function-call/td-p/1007791 and some other resources.
@patrickf2000 I want to note one thing before I forget. intrinsics are or might be compiler specific. If the Intel intrinsics are implemented as just header file/macro, then the SIMD-ized code can be compiled by other compiler such as gcc or clang/llvm. If the intrinsics are implemented specifically for the intel compiler, the SIMD-ized code can only be compiled by the intel compiler.
Please check both intel intrinsics and GCC builtin/intrinsics (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). I am not sure the term GCC uses today (builtin or intrinsics).
@yanyh15 I remember researching that, I have all three compilers on my machine so I checked the headers again.
The SIMD types are typedefs to stuff that may be compiler-specific. It actually looks the same across them so maybe this attribute is actually the same across compilers (its basically an attribute like what GCC implements in your link). Internally, they seem to be compiler specific, but from a source code perspective, they are the same across compilers. The C/C++ AVX intrinsics are actually specified by Intel; if you look through the Intel manual, they are all there. But basically, we can safely use them in our code, and it will be portable across compilers and they are not dependent on any runtimes.
For our purposes, we can just include the proper headers (I got this done) and we can just use the AVX types. This was the part I got stuck on, but this is more of a Rose thing (and probably more of me not knowing the situation).
Hopefully this makes sense
Thank you Patrick. Let us use intel intrinsic s as starting point, at the same time note the following things:
Thank you Yonghong
On Mon, Nov 30, 2020, 7:32 PM Patrick Flynn notifications@github.com wrote:
@yanyh15 https://github.com/yanyh15 I remember researching that, I have all three compilers on my machine so I checked the headers again.
The SIMD types are typedefs to stuff that may be compiler-specific. It actually looks the same across them so maybe this attribute is actually the same across compilers (its basically an attribute like what GCC implements in your link). Internally, they seem to be compiler specific, but from a source code perspective, they are the same across compilers. The C/C++ AVX intrinsics are actually specified by Intel; if you look through the Intel manual, they are all there. But basically, we can safely use them in our code, and it will be portable across compilers and they are not dependent on any runtimes.
For our purposes, we can just include the proper headers (I got this done) and we can just use the AVX types. This was the part I got stuck on, but this is more of a Rose thing (and probably more of me not knowing the situation).
Hopefully this makes sense
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/passlab/rexompiler/issues/54#issuecomment-736139044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUTULWROTPRR6OK57AM36TSSQ2SRANCNFSM4TLPGAMQ .
The differences between classic Cray vector and Intel AVX512, and ARM SVE. SVE set the mask register (predicate) in each iteration of the vectorized loop (https://developer.arm.com/-/media/developer/products/software-tools/hpc/White%20papers/arm-scalable-vector-extensions-and-application-to-machine-learning.pdf). Depending on the hardware implementation of how the predicate mask is used, that checking may introduce extra cycles for executing each vector instruction. It is not necessnary to do that in the conventional strip-mining approach since the mask register is only used for the remindar of the vectorized loop iterations that are not fit in a full vector lane. Compare this approach with Cray Vector, Intel AVX 512 and RISC-V V extension. @patrickf2000 Intrinsics may give you different view of what the actual instruction is used, so please check the assembly and instruction manual.
For RISC-V related work: check http://hwacha.org/, google and spec.
@yanyh15 Sorry for the delay in doing this, but here's my updates so far.
The SIMD IR is defined here: https://github.com/passlab/rexompiler/blob/patrick/simd/src/midend/programTransformation/ompLowering/omp_simd.h
Here are the ideas behind it:
For the IR, I've been debating on a change, I'd be curious to know your thoughts on this:
For transforming the loop itself, I'm currently thinking that would be best to do just before the final generation because that will be platform depedent. For example, on x86, we would either change the loop increment to 8 or 16, depending on AVX-2 or AVX-512. On Arm, we have to use that special variable/function. I'm not sure about RISC-V yet, but I think they work similar to Arm. So basically, the overall order would be like this:
Here's a screenshot of how it looks so far; I created a few functions to print out our IR so I can see what I'm doing...
@patrickf2000 I spent some time to understand the design you have. My original thought is that we still use the existing SageIII IR for the vectorizable expressions that are three-addressed, and attach attribute in the those SageIII node. Attribute are additional info for an AST node that one can add without changing the AST node structure (check http://rosecompiler.org/uploads/ROSE-Tutorial.pdf). The IR you designed can be part of the attribute.
Another option is to introduce new SageIII node for vectorization by defining the IR node you created with SageIII (ROSETTA). We only need three nodes: load, store and math. (other can be added later such as shuffle, FMA). For math, which are binary operation that can be extended from SgBinaryOp or top-level SgExpression. Then we just need load/store node. Since load/store is architecture-terms, we can use names such as arrayRead/arrayWrite, which are more high-level language names. This is just naming.
In terms of where we store type for SIMD, giving that types are the same for source and destination operands in vector, we only need to store a single type for each 3-address expression or load/store node subtree (no need type info for each node of the expression/load/store subtree), e.g. store the type in the root node of 3-address expression/load/store such that we can still have the IR type-neutral. It seems to me that you are doing this way.
We need to sort out IR first for both the 3-address lowering and the generation of intrinsics call. I will be thinking of it along the day and update this thread.
1). SgSIMDLoadn(off SgPntrArrRefExp), SgSIMDStore (off SgPntrArrRefExp) and SgSIMDBinaryOp(off SgBinaryOp), which are the root node of each specific SIMD operations (load/store/math). The type will be one of the class field for the whole subtree (uint32, int32, float32, float64, ushort16, short16, check whether ROSE/REX has those types already, if not we will need to introduce new type node for them).
2). Lowering of multiple-dimensional array reference to 1-dimensional array reference, SIMD-independent, standard scalar code
3). 3-address lowering: SIMD-independent, i.e. standard scalar code
4). SIMD transformation: transform the 3-address program to use SgSIMDLoad/Store/BinaryOp, including strip miningn and strided memory load/store, handing mask, indirect memory access situation, SIMD-dependent, but not architecture-dependent (AVX512, ARM, RISC-V V)
5). SIMD-architecture-dependent transformation: transform SIMDized AST to intrinsics of choice
The plan is to support AVX transformation of straight SIMD code (strip mining, load/store of consecutive/strided memory access) by the end of January. Handing mask and indirect memory load/store will be following
Update of work (1/12)
Note: I put my new work on the patrick/simd2
branch (the original has my first IR).
1) The SgSIMDLoad/Store/BinaryOp base nodes are in. I really haven't tested them yet- I will when I start SIMD transformation- but they seem to be in and everything builds. Rose/Rex has all the types (uint32, int32, float32, float64, ushort16, short16). With the SgSIMDBinaryOp, I added sub-nodes for things like SgSIMDAddOp and so forth. The SIMDLoad/Store doesn't inherit from SgPntrArrRefExp just yet; I'm still having some issues here.
2) Lowering a multi-dimensional array to 1-D is working. It only lowers 1- and 2-D arrays; do we need support for more, are there use cases for this? Also, is there a case where we would need to store back to a multi-dimensional address?
3) 3-address lowering is working. I redid my approach from the first few times so it works a lot better. Originally, I used a stack for everything... this time I use recursion to traverse the tree, and a stack to track variable names and produce the 3-address form. I tested with int, float, and double types (it is easy to expand for more).
For example, for this code:
void foo (int n, float *a, float** b)
{
float x = 2.23;
for (int i=0; i<n; i++) {
#pragma omp simd
for (int j=0; j<n; j++) {
a[i] = b[i][j] * a[i] + x;
}
}
}
This is produced:
void foo(int n,float *a,float **b)
{
int j;
float x = 2.23;
for (int i = 0; i < n; i++) {
#pragma omp simd
for (j = 0; j <= n - 1; j += 1) {
float __vec0;
__vec0 = b[i * n + j];
float __vec1;
__vec1 = a[i];
float __vec2;
__vec2 = __vec1 * __vec0;
float __vec3;
__vec3 = x;
float __vec4;
__vec4 = __vec3 + __vec2;
a[i] = __vec4;
}
}
}
For int A[M][N][W]
A[i][j][k] ==> A[iMN + j N + k] = A[(iM+j)* N + k]
@yanyh15 I was thinking about the different solutions we discussed; I'm thinking for right now at least the one outlined in the document will be best.
As far as implementation goes, we can use the SgExprListExp class directly as the right value of our binaryOp. As I understand, its basically a wrapper around the standard C++ vector. When we do transformations, its easy to check the size, and we don't have to perform any multi-level tree traversals. All the infrastructure is already there, so this should be easy to implement.
I don't think we should pursue our idea of creating a new class with multiple sub-values (basically a new version of SgBinaryOp, only with three operands). I think this would be a little complex, and its creates a special case- if we have more than two sources operands, such as with FMA operations, we would have to create new classes for each.
Any other methods would probably get us into the territory of creating a new custom IR for this. Perhaps long term this may be an interesting idea- it may allow for more optimizations- but I can understand why this would not be a preferred solution at the moment. I don't think an separate IR would restrict other users, the question might be more of how to expose it.
SgSIMDAddOp/SgSIMDSubOp/SgSIMDMulOp/SgSIMDDivOp/SgSIMDFMAOp
which inherits SgBinaryOp
with the right node to be SgExprListExp
. The number of operands depends on the how many valid expressions are in the SgExprListExp
. The left node is an SgExpr
SgSIMDLoad/SgSIMDStore
inherits SgBinaryOp
(or SgAssignOp
), with source for load and dest for store to be SgPntrArrRefExp
.
How intrinsics support mask, strided memory reference, indirect memory reference? Do we need dedicated node to support those two kinds of operations? Probably we will need nodes for them.
| | mask/predicate | strided memory refernce | indirect memory reference | |AVX512 | ||| | ARM | ||| |RISC-V | |||
Transformation for straightforward cases with strip mining (with mask). Use C++ template for handling type of transforming the same node.
: OpenMP SIMD version for dense mm which is used for testing the strided load/store, and SIMD version of sparse MV, which is used for indirect load/store are in https://github.com/passlab/Benchmarks/tree/master/vectorization. They are very similar to (25 and 26) https://passlab.github.io/CSCE513/notes/lecture20_DLP_Vector.pdf#page=25 Please use jacobi from https://github.com/passlab/Benchmarks/tree/master/jacobi for the evaluation of collapse. Find the instrincs in AVX512 and ARM SVX for strided load/store and indirect load/store
SIMD transformation in the AST level is not the correct approach to do for compiler. vectorizations needs to work on mid/low-level IR, namely 3-address IR for the vectorization. For example, we need 3-address representation of
a[i]=b[i]+j+k;
asAST --> 3-address SSA: