tancheng / CGRA-Mapper

An LLVM pass that can generate CDFG and map the target loops onto a parameterizable CGRA.
BSD 3-Clause "New" or "Revised" License
53 stars 8 forks source link

Heuristic mapper produce worst result when size of CGRA increase #16

Closed KelvinChung2000 closed 7 months ago

KelvinChung2000 commented 7 months ago

I am trying to map the following kernel.

#include <string.h>
#include <stdio.h>

#define SIZE  20
int A[SIZE], B[SIZE], C[SIZE];

void external_fun() {C[0] = 7;}

__attribute__((noinline))
void array_cond(){

   int x;
   for (int i=0;i<SIZE; i++){
    if (A[i] > B[i]){
        C[i] = A[i] - B[i];
        if (C[i] < 0)
            C[i] = 0;
        else
            C[i] = A[i] * B[i];
    }
    else{
        C[i] = A[i] + B[i];
        if (C[i] > 10)
            C[i] = 10;
        else
            C[i] = A[i] - B[i];
    }
   }
}

int main(){

int i,j;

for (i=0;i<SIZE; i++){
      A[i] = i * 2 + 5;
      B[i] = i * 3 + 8;
      C[i] = 0;
    }

array_cond();

for (i=0;i<SIZE; i++) printf("%d\n", C[i]);

return 0;

}

When the CGRA size is 4x4 the mapper produces a mapping with II=13. When the size is 8x8, the mapped II=19.

This is the configuration used

{
  "kernel"                : "array_cond",
  "targetFunction"        : false,
  "targetNested"          : false,
  "targetLoopsID"         : [0],
  "doCGRAMapping"         : true,
  "row"                   : 8,
  "column"                : 8,
  "precisionAware"        : false,
  "heterogeneity"         : false,
  "isTrimmedDemo"         : true,
  "heuristicMapping"      : true,
  "parameterizableCGRA"   : false,
  "diagonalVectorization" : false,
  "bypassConstraint"      : 4,
  "isStaticElasticCGRA"   : false,
  "ctrlMemConstraint"     : 10,
  "regConstraint"         : 8,
  "optLatency"            : {
                              "load" : 2,
                              "store": 2
                            },
  "optPipelined"          : ["load", "store"],
  "additionalFunc"        : {
                              "load" : [],
                              "store": []
                            }
}
tancheng commented 7 months ago

Hi Kelvin, you can remove optLatency, optPipelined, and additionalFunc from the param.json, they are just used for evaluating specialized FUs.

Unfortunately, we still get II=12 on 4x4, and we may get worse mapping on 8x8 (due to the large exploration space for mapping). The reason is that we support control flows but didn't well optimize it. See the generated DFG as follows, the DFG looks fine/reasonable to me. Due to the if/else the kernel contains, the recurrence dependence cycle length is 12, which makes the mapping cannot achieve II less than 12. If you look into the generated kernel.ll (by modifying the dot.sh), you can figure out why there are so many blue arrows connecting br nodes with other nodes (let me know if you need some explanations).

So in conclusion, the control flows are not well optimized, which leads to "bad" performance. This can be potentially improved by manipulating the DFG with appropriate HW support. kernel