Single level should be a better choice for parameterization but need to care about the data filling when constructing higher precision Fixed-Point computation (and also the subFU's data bandwidth).
More than 4 can be recognized by LLVM pass but the HW implementation is only prototyped for at most 4 for now.
[x] Implement AllReduce unit.
[x] Check Bitcast functionality.
Bitcast is used to cast the type without changing any bits.
We don't need a precision converter if all things are in Fixed-Point precision type.
[x] Check how the decimals are handled in Fixed- and Floating-Point FUs.
Fixed-Point is very straightforward. No matter where the point is set, addition and multiplication can be done in the binary format (the result's decimal place can be put in software or afterward).
Floating-Point's computation is more complex, which involves normalization, shift, addition, reformat, etc. And its format is not straightforwardly to be recognized for vectorized computation (i.e., 8/23 and 11/52 of exponent/mantissa for FP32 and FP64, respectively).
In conclusion, there is no need to support vectorized FP computation with various precision types. We can have scalar FP FU in the PE to guarantee accuracy if necessary. Instead, the Fixed-Point computation can be vectorized with different precisions support.
[x] Generate SVerilog for VectorMulComboRTL.
[x] Check control-flow inside the SIMD context. Check out how the mask is used.
Masked might not be necessary for now. But the other insts require vector support (e.g., select, phi, icmp, and, or, xor, etc).
[ ] Can be done later.
[x] Generate SVerilog for FlexFu.
[x] CGRA fabric with heterogeneous bandwidth across different PEs.
[ ] This can actually be avoided and it is not possible at least in PyMTL (where every submodule should be uniform). We need to extend the tile's routing ports to 8 to enable the diagonal connections. The horizontal/vertical ones are still in scalar bandwidth while the diagonal ones are in vectorized bandwidth.
[x] Need implement 8-direction channels.
[ ] Two papers.
[ ] Accelerator (ASPLOS'22).
[x] SIMD/scalar operations distribution across different kernels (need to update to the entire function rather than the loop body to include the reduce operation).
[x] A graph to demonstrate how the vectorization outperforms unrolling.
[x] Get the accuracy of the vectorized version.
[x] Support heterogeneous vectorization in CGRA tiles.
[x] Enable mapping based on the types (i.e., scalar/simd) of tiles.
[x] Get the area overhead of the 4 types of CGRAs and the corresponding area-/power-efficiency. If open-source is not possible, we can estimate it based on tile stats. (Power requires other tools, which can be skipped for this paper.)
[x] Initial evaluation on kernels with 4x4 and 8x8 VRSA (16 or 32 fixed-point?).