OpenMP: SIMD: simd Construct

passlab / rexompiler

REX OpenMP Compiler

https://passlab.github.io/rexompiler/

Other

1 stars 1 forks source link

OpenMP: SIMD: simd Construct #54

Open yanyh15 opened 4 years ago

yanyh15 commented 3 years ago

SIMD transformation in the AST level is not the correct approach to do for compiler. vectorizations needs to work on mid/low-level IR, namely 3-address IR for the vectorization. For example, we need 3-address representation of

a[i]=b[i]+j+k; as

tmp0 = B[i];  //load
tmp1 = tmp0 + j; //add
tmp2 = tmp1 + k;//add
A[i] = tmp[2];//store

Check compiler textbook or general info about how compiler perform vectorization. E.g. https://en.wikipedia.org/wiki/Automatic_vectorization
check LLVM SIMD implementation to see whether they implement it in Clang or LLVM. E.g. https://llvm.org/docs/Vectorizers.html, https://llvm.org/docs/Proposals/VectorizationPlan.html. There may be other resources, please google. Google Xinmin Tian's paper/presentation about OpenMP SIMD implementation (https://llvm.org/devmtg/2016-11/Slides/Saito-NextLevelLLVMLoopVectorizer.pdf, https://www.youtube.com/watch?v=XXAvdUwO7kQ, https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/Vectorize/LoopVectorize.cpp).
A class project shows the basic steps of vectorization, https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-autovec/. For OpenMP, not legality check is needed.
Since we are using intrinsics, which could be easier than instructions. Check both x86 AVX and ARM SVE intrinsics to see what kind of operands they are expecting.
Thinking of different solutions of combining compiler transformation with macro definition of multiple intrinsic. e.g. D = a+b+c; ==> define a macro that use two intrinsic

AST --> 3-address SSA:

google "AST to 3-address SSA", some info http://www.cs.columbia.edu/~aho/cs4115/Lectures/15-03-25.html

pflynn157 commented 3 years ago

Some info so far:

Intel intrinsics (probably the best guide; also, Intel manual really good): https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2

Vectorization examples (mainly AVX/few SVE): https://github.com/patrickf2000/vectorization-examples

The algorithm (From 11/25 meeting, despite later posting):

For translating vector expression to AST

Intel intrinsics:

Think of them like a RISC-machine; although they often translate to regular CISC instructions, the intrinsics are only load-store
Intrinsic variables often translate directly to vector registers

AST elements:

Arrays- Loaded based on array name and index -> These go inside the loop body
Variables and constants- These can be saved to a vector register using the broadcast instructions -> This should be done just before the loop body

ROSE parses expressions in postfix form; however, it comes up reversed, so we will need to parse backwards

The for-loop increment will need to be changed; on AVX-2, this will be changed to 8

Algorithm: 1) Break up variable expression into the lvalue and rvalue (the lvalue is the store, all else is load and operations) [THIS PART DONE] 2) Maintain a stack of strings 3) Loop through the expression

If node is a constant or non-pointer variable: -> Create a broadcast instruction above the for loop -> The result of the broadcast is saved to a vector variable -> Push this variable to the stack
If node is a pointer: -> Create load in loop -> The result of the load is saved to a vector variable -> Push variable name to the stack
If node is an operator: -> Pop two operands (variable names) from the stack -> Use these as arguments to the corresponding vector function (add, sub, mul, div, etc) -> The result is is saved to a vector variable -> Push variable to the stack 4) Store result
At the end we should only have one variable in the stack
Use a store instruction to save it to the LVALUE

Example:

for (int i = 0; i<512; i++) {
    result[i] = arr1[i] + arr2[i] * x;
}

Using the algorithm:

__m256 _const1 = _mm256_broadcast_ss(&x);

for (int i = 0; i < 512; i += 8) {
    __m256 _vec1 = _mm256_loadu_ps(&arr2[i]);
    __m256 _result1 = _mm256_mul_ps(_vec1, _const1);
    __m256 _vec2 = _mm256_loadu_ps(&arr1[i]);
    __m256 _result2 = _mm256_add_ps(_result1, _vec2);

    __m256_storeu_ps(&result[i], _result2);
}

After this is working:

Type detection- is it a float, double, or int?
Maybe the partial sums method: https://github.com/patrickf2000/vectorization-examples/blob/master/intel-avx2/partial-sum/sum_partial2.asm -> This would be more advanced as it would require changing the for loop to a while loop and adding if-statements, but reduction even with non-sum opertions is fast, provided that the assembly comes up kind of like this
Nested loops

yanyh15 commented 3 years ago

Implement as two passes: one for AST->3Address IR transformation, and the other for vectorization. We will not do SSA form.

1. AST->3Address AST:

expression,
direct array references (including multiple dimensional array) using standard vector load/store
```
   for (i ...)
       for (j ....)
              A[i][j] = B[j][i]; 
```
direct array reference with stride using strided load/store

Indirect array reference A[B[i]], use case (spare matrix), with gather and scatter

   for (i = 0; i < n; i=i+1)
         A[K[i]] = A[K[i]] + C[M[i]];

A[K[i]] ==> 3Address

        tmp1 = K[i];
        tmp2 = A[tmp1]; (option 1)

Need to store some attribute info for this two AST to indicate that these two 3-address statements are for A[K[i]] such that gather/scatter can be used for the transformation.

if/else, if expression needs to be transformed to 3Address AST and then if will just test a single variable.
while loop within a simd loop (not supported yet unless there is use case for it)
Other cases such as using struct field,
function call with regards to declare simd. Inlining is likely the approach for handling function call with vectorization. check https://community.intel.com/t5/Software-Archive/Vectorization-of-a-function-call/td-p/1007791 and some other resources.

1. Vectorization

Loop must be Canonical Loop Nest Form.
I am not sure whether canonical loop form checking is done or not, it is applicable for both simd and for.

yanyh15 commented 3 years ago

@patrickf2000 I want to note one thing before I forget. intrinsics are or might be compiler specific. If the Intel intrinsics are implemented as just header file/macro, then the SIMD-ized code can be compiled by other compiler such as gcc or clang/llvm. If the intrinsics are implemented specifically for the intel compiler, the SIMD-ized code can only be compiled by the intel compiler.

Please check both intel intrinsics and GCC builtin/intrinsics (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). I am not sure the term GCC uses today (builtin or intrinsics).

pflynn157 commented 3 years ago

@yanyh15 I remember researching that, I have all three compilers on my machine so I checked the headers again.

The SIMD types are typedefs to stuff that may be compiler-specific. It actually looks the same across them so maybe this attribute is actually the same across compilers (its basically an attribute like what GCC implements in your link). Internally, they seem to be compiler specific, but from a source code perspective, they are the same across compilers. The C/C++ AVX intrinsics are actually specified by Intel; if you look through the Intel manual, they are all there. But basically, we can safely use them in our code, and it will be portable across compilers and they are not dependent on any runtimes.

For our purposes, we can just include the proper headers (I got this done) and we can just use the AVX types. This was the part I got stuck on, but this is more of a Rose thing (and probably more of me not knowing the situation).

Hopefully this makes sense

yanyh15 commented 3 years ago

Thank you Patrick. Let us use intel intrinsic s as starting point, at the same time note the following things:

There is vector-FMA such as https://en.wikichip.org/wiki/x86/avx512_vnni, and https://en.wikichip.org/wiki/x86/avx512_bf16. Intel markets them for neural network instruction, they are essentially vector version of FMA in my opinion yet handling reduced-precision data type. Those instructions/intrinsics have 4 registers for their operands (maybe even 5 or more operands), and 4-address statement IR is needed. So in the lowering of AST to x-address IR, look for those FMA pattern and mark them in the AST attribute. They can then be vectorized using vector-FMA instructions.
Compare AVX-512 with ARM SVE, and make your transformation as neural as possible between AVX512 and SVE intrinsics, such that you can reuse most of the code for lowering and simdization.
Intel introduced Advanced Matrix Extension (AMX) recently, which seems like systolic array. IBM power something called MMA (which seems vector extension of FMA?). I am still reading their details to see how this can be used for transformation target. Will see.
This year LCPC keynote talk presented the use of compiler analysis to find code pattern that can be accelerated via vendor's high performance library, e.g for mm, it can be accelerated using intel MKL or cuda cublas. That is something for us to be aware of for choosing a good target for a code region.

Thank you Yonghong

On Mon, Nov 30, 2020, 7:32 PM Patrick Flynn notifications@github.com wrote:

@yanyh15 https://github.com/yanyh15 I remember researching that, I have all three compilers on my machine so I checked the headers again.

The SIMD types are typedefs to stuff that may be compiler-specific. It actually looks the same across them so maybe this attribute is actually the same across compilers (its basically an attribute like what GCC implements in your link). Internally, they seem to be compiler specific, but from a source code perspective, they are the same across compilers. The C/C++ AVX intrinsics are actually specified by Intel; if you look through the Intel manual, they are all there. But basically, we can safely use them in our code, and it will be portable across compilers and they are not dependent on any runtimes.

For our purposes, we can just include the proper headers (I got this done) and we can just use the AVX types. This was the part I got stuck on, but this is more of a Rose thing (and probably more of me not knowing the situation).

Hopefully this makes sense

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/passlab/rexompiler/issues/54#issuecomment-736139044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUTULWROTPRR6OK57AM36TSSQ2SRANCNFSM4TLPGAMQ .

yanyh15 commented 3 years ago

The differences between classic Cray vector and Intel AVX512, and ARM SVE. SVE set the mask register (predicate) in each iteration of the vectorized loop (https://developer.arm.com/-/media/developer/products/software-tools/hpc/White%20papers/arm-scalable-vector-extensions-and-application-to-machine-learning.pdf). Depending on the hardware implementation of how the predicate mask is used, that checking may introduce extra cycles for executing each vector instruction. It is not necessnary to do that in the conventional strip-mining approach since the mask register is only used for the remindar of the vectorized loop iterations that are not fit in a full vector lane. Compare this approach with Cray Vector, Intel AVX 512 and RISC-V V extension. @patrickf2000 Intrinsics may give you different view of what the actual instruction is used, so please check the assembly and instruction manual.

For RISC-V related work: check http://hwacha.org/, google and spec.

pflynn157 commented 3 years ago

@yanyh15 Sorry for the delay in doing this, but here's my updates so far.

The SIMD IR is defined here: https://github.com/passlab/rexompiler/blob/patrick/simd/src/midend/programTransformation/ompLowering/omp_simd.h
Here are the ideas behind it:

The first two enums define data types and IR types, respectively. The IR can represent a C array, a vector type, and then all the vector operations such as vector load, scalar load, vector store, and math
The IR transformation happens in the omp_simd; this part is broken down into functions to handle loads, stores, and math. The algorithm still works like above
The big thing I'm really happy with, even though the IR is SSA, it doesn't translate into read-only variables like LLVM IR; during transformation, there is a string stack, which allows us to do the transformation and reuse variables as much as possible
Currently, the scalars are included within the loop body; however, when we do the final transformation, they will move out above the for-loop

For the IR, I've been debating on a change, I'd be curious to know your thoughts on this:

Currently, the IR is strongly typed -> there's a separate IR node for operation of each type (so there's a separate node for integer math, float math, etc...). The benefit of this is it reduces the amount of comparisons needed during final transformation, but the trade-off is that the IR will get pretty big
Would do you think about dropping the typed IR, and making the nodes only based on operation? So instead of integer add and float add, we only have "add", with an internal variable to hold the data type. The benefit of this is a simpler IR, but the trade-off is more checking based on variable type

For transforming the loop itself, I'm currently thinking that would be best to do just before the final generation because that will be platform depedent. For example, on x86, we would either change the loop increment to 8 or 16, depending on AVX-2 or AVX-512. On Arm, we have to use that special variable/function. I'm not sure about RISC-V yet, but I think they work similar to Arm. So basically, the overall order would be like this:

Generate IR
Any IR transforms we want to do (ie, convert a sequential add-multiply to fused-multiply-add)
Transform the for loop based on platform
Generate the final code (intrinsics)

Here's a screenshot of how it looks so far; I created a few functions to print out our IR so I can see what I'm doing... new_IR

yanyh15 commented 3 years ago

@patrickf2000 I spent some time to understand the design you have. My original thought is that we still use the existing SageIII IR for the vectorizable expressions that are three-addressed, and attach attribute in the those SageIII node. Attribute are additional info for an AST node that one can add without changing the AST node structure (check http://rosecompiler.org/uploads/ROSE-Tutorial.pdf). The IR you designed can be part of the attribute.

Another option is to introduce new SageIII node for vectorization by defining the IR node you created with SageIII (ROSETTA). We only need three nodes: load, store and math. (other can be added later such as shuffle, FMA). For math, which are binary operation that can be extended from SgBinaryOp or top-level SgExpression. Then we just need load/store node. Since load/store is architecture-terms, we can use names such as arrayRead/arrayWrite, which are more high-level language names. This is just naming.

In terms of where we store type for SIMD, giving that types are the same for source and destination operands in vector, we only need to store a single type for each 3-address expression or load/store node subtree (no need type info for each node of the expression/load/store subtree), e.g. store the type in the root node of 3-address expression/load/store such that we can still have the IR type-neutral. It seems to me that you are doing this way.

We need to sort out IR first for both the 3-address lowering and the generation of intrinsics call. I will be thinking of it along the day and update this thread.

yanyh15 commented 3 years ago

1). SgSIMDLoadn(off SgPntrArrRefExp), SgSIMDStore (off SgPntrArrRefExp) and SgSIMDBinaryOp(off SgBinaryOp), which are the root node of each specific SIMD operations (load/store/math). The type will be one of the class field for the whole subtree (uint32, int32, float32, float64, ushort16, short16, check whether ROSE/REX has those types already, if not we will need to introduce new type node for them).

2). Lowering of multiple-dimensional array reference to 1-dimensional array reference, SIMD-independent, standard scalar code

3). 3-address lowering: SIMD-independent, i.e. standard scalar code

4). SIMD transformation: transform the 3-address program to use SgSIMDLoad/Store/BinaryOp, including strip miningn and strided memory load/store, handing mask, indirect memory access situation, SIMD-dependent, but not architecture-dependent (AVX512, ARM, RISC-V V)

5). SIMD-architecture-dependent transformation: transform SIMDized AST to intrinsics of choice

The plan is to support AVX transformation of straight SIMD code (strip mining, load/store of consecutive/strided memory access) by the end of January. Handing mask and indirect memory load/store will be following

pflynn157 commented 3 years ago

Update of work (1/12)

Note: I put my new work on the patrick/simd2 branch (the original has my first IR).

1) The SgSIMDLoad/Store/BinaryOp base nodes are in. I really haven't tested them yet- I will when I start SIMD transformation- but they seem to be in and everything builds. Rose/Rex has all the types (uint32, int32, float32, float64, ushort16, short16). With the SgSIMDBinaryOp, I added sub-nodes for things like SgSIMDAddOp and so forth. The SIMDLoad/Store doesn't inherit from SgPntrArrRefExp just yet; I'm still having some issues here.

2) Lowering a multi-dimensional array to 1-D is working. It only lowers 1- and 2-D arrays; do we need support for more, are there use cases for this? Also, is there a case where we would need to store back to a multi-dimensional address?

3) 3-address lowering is working. I redid my approach from the first few times so it works a lot better. Originally, I used a stack for everything... this time I use recursion to traverse the tree, and a stack to track variable names and produce the 3-address form. I tested with int, float, and double types (it is easy to expand for more).

For example, for this code:

void foo (int n, float *a, float** b)
{
float x = 2.23;
  for (int i=0; i<n; i++) {
  #pragma omp simd
    for (int j=0; j<n; j++) {
        a[i] = b[i][j] * a[i] + x;
    }
  }
}

This is produced:

void foo(int n,float *a,float **b)
{
  int j;
  float x = 2.23;
  for (int i = 0; i < n; i++) {
#pragma omp simd 
    for (j = 0; j <= n - 1; j += 1) {
      float __vec0;
      __vec0 = b[i * n + j];
      float __vec1;
      __vec1 = a[i];
      float __vec2;
      __vec2 = __vec1 * __vec0;
      float __vec3;
      __vec3 = x;
      float __vec4;
      __vec4 = __vec3 + __vec2;
      a[i] = __vec4;
    }
  }
}

yanyh15 commented 3 years ago

For int A[M][N][W]

A[i][j][k] ==> A[iMN + j N + k] = A[(iM+j)* N + k]

pflynn157 commented 3 years ago

@yanyh15 I was thinking about the different solutions we discussed; I'm thinking for right now at least the one outlined in the document will be best.

As far as implementation goes, we can use the SgExprListExp class directly as the right value of our binaryOp. As I understand, its basically a wrapper around the standard C++ vector. When we do transformations, its easy to check the size, and we don't have to perform any multi-level tree traversals. All the infrastructure is already there, so this should be easy to implement.

I don't think we should pursue our idea of creating a new class with multiple sub-values (basically a new version of SgBinaryOp, only with three operands). I think this would be a little complex, and its creates a special case- if we have more than two sources operands, such as with FMA operations, we would have to create new classes for each.

Any other methods would probably get us into the territory of creating a new custom IR for this. Perhaps long term this may be an interesting idea- it may allow for more optimizations- but I can understand why this would not be a preferred solution at the moment. I don't think an separate IR would restrict other users, the question might be more of how to expose it.

yanyh15 commented 3 years ago

SgSIMDAddOp/SgSIMDSubOp/SgSIMDMulOp/SgSIMDDivOp/SgSIMDFMAOp which inherits SgBinaryOp with the right node to be SgExprListExp. The number of operands depends on the how many valid expressions are in the SgExprListExp. The left node is an SgExpr

SgSIMDLoad/SgSIMDStore inherits SgBinaryOp (or SgAssignOp), with source for load and dest for store to be SgPntrArrRefExp.

yanyh15 commented 3 years ago

How intrinsics support mask, strided memory reference, indirect memory reference? Do we need dedicated node to support those two kinds of operations? Probably we will need nodes for them.

| | mask/predicate | strided memory refernce | indirect memory reference | |AVX512 | ||| | ARM | ||| |RISC-V | |||

Transformation for straightforward cases with strip mining (with mask). Use C++ template for handling type of transforming the same node.

yanyh15 commented 3 years ago

: OpenMP SIMD version for dense mm which is used for testing the strided load/store, and SIMD version of sparse MV, which is used for indirect load/store are in https://github.com/passlab/Benchmarks/tree/master/vectorization. They are very similar to (25 and 26) https://passlab.github.io/CSCE513/notes/lecture20_DLP_Vector.pdf#page=25 Please use jacobi from https://github.com/passlab/Benchmarks/tree/master/jacobi for the evaluation of collapse. Find the instrincs in AVX512 and ARM SVX for strided load/store and indirect load/store