CMakeList parameter configuration

charliechou1001 commented 2 years ago

Hi Johannes, Thanks a lot for your help on answering the gemm_hls and hlslib project questions in recent days. I have a question on the parameter configuration for the GEMM. Then I explored the GEMM parameter, but the result is far from the result provided in Tab.2 in the paper, so my questions are: 1) What's your parameter configuration details for Tab.2? 2) From my understanding, the xb, yb in the paper is kInnerTilesN, kInnerTilesM in the code, and xm, ym is OuterTilesN(size_n), OuterTilesM(size_m) in the code, xp is number of PEs, and yc is number of parallel MAC units in each PEs. But what's the meaning of xbxm, ybym in Tab.2? 3) How is the performance in Tab.2 calculated? Does the latency for calculating performance cover the the whole processing period, including data read and write back? These are my questions, and looking forward to your reply.

P.S. Thanks very much for the project. It saved me a lot of time for implementing my project, and the code is really nice, I like it!

definelicht commented 2 years ago

Hi there! I think that you are seeing a penalty from having a high degree of parallelism relative to the tile size. This happens because there are explicit initialization and draining phases, which become larger relatively to the computation phase as you increase parallelism (Amdahl's law). The result in the paper is obtained from running on a significantly larger tile size of 1904x1920.

If you have the option to compute larger matrices, and thereby allow larger tile sizes, then this should improve your results. If not, I'm afraid there's no easy way to circumvent this bottleneck without significantly changing the architecture, as it is an artifact of using the 1D systolic array.

charliechou1001 commented 2 years ago

Thanks, I got it. Since it's a 1D systolic array, fewer PE numbers (shorter systolic array) will decrease its initialization time, as well as the latency for one inner tile computation. And from my exploration, I found that half the PE number and double the parallelism inside each PE can significantly decrease the whole latency. So it seems less PE and more parallel inside PE can improve the performance. But in Tab.2, you use a relatively small parallelism number (yc), which varies from 8 to 32, is that due to the bandwidth limitation?

In addition, you said "MM_PARALLELISM_M should be set to a maximum of 64 bytes / sizeof() (i.e., 8 for float or int, 4 for double or long, 16 for 16-bit int, etc.) to avoid performance and routing issues.", is the number 64 bytes also related to the DDR bandwidth?

definelicht commented 2 years ago

Since it's a 1D systolic array, fewer PE numbers (shorter systolic array) will decrease its initialization time, as well as the latency for one inner tile computation.

Yes, the initialization phase (population the buffer of A) and the draining phase (writing out the tile of C) will take more cycles for a longer systolic array.

is the number 64 bytes also related to the DDR bandwidth?

Yes, exactly. MM_PARALLELISM_M is the "horizontal" parallelism, also known as vectorization/SIMD-style parallelism. Increasing it will make the data path through the design wider, and thus increase the width of all buses in the design. It is limited at 512 bits, since this is the maximum we can read from an AXI Master interface in a single cycle. This corresponds to 8x 64-bit numbers (e.g., double), 16x 32-bit numbers (e.g., float), 32x 16-bit numbers (e.g., half)), etc.

charliechou1001 commented 2 years ago

Hi, I followed your advice and do some experiments with different memory tile size and matrix size configuration, but the result is confusing:

First, I changed either dimension of matrix size(size_m, size_k, size_n). when size_k is bigger, the performance is worse than the baseline result(size_m=size_k=size_n=512), which is contradict to the fig.8 in the paper(but I think the paper result is more reasonable), and when I change size_n and size_m, the result is not equal, since the two parameter is dual. Second, I did according to what you said, to increase the matrix size, and I also increased the memory tile size, but both synthesis result is much worse than the baseline result.

The result is really confused me, here is the setting details and experiment result, is there anything wrong with my settings? Thank you!

definelicht commented 2 years ago

I'm also a bit confused by this. Did you actually run these in hardware, or is this just based on reports from HLS? I suspect that the tool does not accurately predict the runtime. I also suggest running with MM_DYNAMIC_SIZES=ON so you can run different matrix sizes using the same bitstream.

charliechou1001 commented 2 years ago

Maybe this is one of the reasons, these are HLS report results. And I also discussed this problem with my workmates, they provided me with several suggestions:

run it on hardware to get real experiment result(same as yours), which I'm working on
maybe due to the bandwidth limit, the system works under I/O bound situation, while the fig.8 experiment works under computing bound situation, then my experiment doesn't follow the fig.8 curve. But I'm a bit doubtful about this, by calculating the computing intensity result, we found the computing intensity depends on Xtot, Ytot and K, the bigger the better, which is same as the fig.8, fig.9 conclusion.

And MM_DYNAMIC_SIZE is used to set the dynamic size of three out-most for-loop, what's its effect for synthesis if set with ON?

definelicht commented 2 years ago

If the tile sizes are reasonably large you really shouldn't see any bandwidth problems. And more importantly, the HLS tool will not detect this. So I don't think this is your problem!

And MM_DYNAMIC_SIZE is used to set the dynamic size of three out-most for-loop, what's its effect for synthesis if set with ON?

MM_DYNAMIC_SIZE=ON means that you can run it on any size of matrix. The reason why it only touches the outer loop bound in the code is because the kernel will always compute a full tile (even if all results are not needed). It will ignore all contributions that are not in bounds.

definelicht commented 2 years ago

Let me know what results you see in hardware!

charliechou1001 commented 2 years ago

Hi, I made the kernel run on U250. Here I have two questions:

When I synthesis half precision GEMM, there is no problem during "make hw", but when I run ./RunHardware, it shows that "./RunHardware.exe: error while loading shared libraries: libgmp.so.7: cannot open shared object file: No such file or directory", have you met similar problems before?
I synthesize successfully using float precision with the default setting in the github CMakeList.txt, I found that in one single SLR utilization is full, but the total utilization is below 20%. And the performance for a 512 512 512 GEMM(dynamic size is ON) is 128.048 Gops, with the matrix size increase, the performance will also be improved. But when I increase MM_PARALLELISM_M or MM_PARALLELISM_N, it will report routing congestion during link stage. In that case, how can I adjust these parameters to increase the utilization for multi SLR?

charliechou1001 commented 2 years ago

And does the "chiplet" you mentioned in fig.7 in the paper correspond to the SLR?

definelicht commented 2 years ago

And does the "chiplet" you mentioned in fig.7 in the paper correspond to the SLR?

Yes, chiplet refers to SLRs.

definelicht commented 2 years ago

When I synthesis half precision GEMM, there is no problem during "make hw", but when I run ./RunHardware, it shows that "./RunHardware.exe: error while loading shared libraries: libgmp.so.7: cannot open shared object file: No such file or directory", have you met similar problems before?

Ah, that's annoying. It must mean the the rpath of the executable is not set properly. You should be able to work around this by adding the directory of the Xilinx floating point libraries to your LD_LIBRARY_PATH when you run the executable, e.g.:

export LD_LIBRARY_PATH=/opt/Xilinx/Vitis_HLS/2021.2/lnx64/tools/fpo_v7_0

I synthesize successfully using float precision with the default setting in the github CMakeList.txt, I found that in one single SLR utilization is full, but the total utilization is below 20%. And the performance for a 512 512 512 GEMM(dynamic size is ON) is 128.048 Gops, with the matrix size increase, the performance will also be improved. But when I increase MM_PARALLELISM_M or MM_PARALLELISM_N, it will report routing congestion during link stage. In that case, how can I adjust these parameters to increase the utilization for multi SLR?

Unfortunately there is no easy answer to this question, since the routing is quite chaotic. For the paper, I determined the highest parallelism I could achieve empirically. You can try the parameters that I used in the paper. Potentially even higher parallelism should be possible on the U250 :-) I recommend not making the vector parallelism too high (MM_PARALLELISM_M), keep it at 128 or 256-bit width (e.g., 8 for floats, 16 for half), since this makes routing harder.

Ideally the could should explicitly handle the mapping between multi-SLRs, but I have not implemented this.

charliechou1001 commented 2 years ago

Thanks. But when I try to compile uint8 gemm with the parameter in paper, it reports error like this: Screenshot 2022-05-30 162312

And my CMakeList file is here: CMakeLists.txt

definelicht commented 2 years ago

This is a very large example to run in simulation. Simulation is quite slow compared to running in hardware. Could it be that it is simply taking extremely long? Try to see if it finishes for a smaller example in simulation, or run it in hardware.

charliechou1001 commented 2 years ago

Hi, I synthesized the GEMM with various data type, as well as my data type, the performance is really good. When I run cmake, sometimes it will report warning https://github.com/spcl/gemm_hls/blob/7c790eb660aca2754ea8026fb2e911c47565ef8c/CMakeLists.txt#L57, is this warning neglectable? Since I saw the uint data type configuration in Table 2 neglected this warning, but FP data type obeyed this warning.

definelicht commented 2 years ago

The warning means that you might not get full throughput, because the component that needs to feed values of A cannot operate as fast as the systolic array :-) You can make it go away by increasing the tile size further, if that's possible for you.

charliechou1001 commented 2 years ago

Hi, I have question again. I'm trying to find the hardware resource consumption report. Which report is the hardware utilization report for the GEMM? And I found in report/link/imp/ directory, there are many reports: Is impl_Kernel_utl and impl_full_util the GEMM kernel and system hardware utilization report? Thanks!

definelicht commented 2 years ago

I would look at impl_1_full_util_routed.rpt :-)

charliechou1001 commented 2 years ago

and where can you get the frequency value?

definelicht commented 2 years ago

and where can you get the frequency value?

If the path to your compiled xclbin file is foo.xclbin, then:

xclbinutil --info --input foo.xclbin

charliechou1001 commented 2 years ago

Thanks!

charliechou1001 commented 2 years ago

in addition, is there anyway to get power for Alveo card when running the code?

definelicht commented 2 years ago

I don't think there's currently a programmatic way. You would have to use their CLI: https://xilinx.github.io/Alveo-Cards/master/debugging/build/html/docs/common-steps.html#monitor-card-power-and-temperature

spcl / gemm_hls

CMakeList parameter configuration #28