Closed charliechou1001 closed 2 years ago
Hi there! I think that you are seeing a penalty from having a high degree of parallelism relative to the tile size. This happens because there are explicit initialization and draining phases, which become larger relatively to the computation phase as you increase parallelism (Amdahl's law). The result in the paper is obtained from running on a significantly larger tile size of 1904x1920.
If you have the option to compute larger matrices, and thereby allow larger tile sizes, then this should improve your results. If not, I'm afraid there's no easy way to circumvent this bottleneck without significantly changing the architecture, as it is an artifact of using the 1D systolic array.
Thanks, I got it. Since it's a 1D systolic array, fewer PE numbers (shorter systolic array) will decrease its initialization time, as well as the latency for one inner tile computation. And from my exploration, I found that half the PE number and double the parallelism inside each PE can significantly decrease the whole latency. So it seems less PE and more parallel inside PE can improve the performance. But in Tab.2, you use a relatively small parallelism number (yc), which varies from 8 to 32, is that due to the bandwidth limitation?
In addition, you said "MM_PARALLELISM_M should be set to a maximum of 64 bytes / sizeof(
Since it's a 1D systolic array, fewer PE numbers (shorter systolic array) will decrease its initialization time, as well as the latency for one inner tile computation.
Yes, the initialization phase (population the buffer of A) and the draining phase (writing out the tile of C) will take more cycles for a longer systolic array.
is the number 64 bytes also related to the DDR bandwidth?
Yes, exactly. MM_PARALLELISM_M
is the "horizontal" parallelism, also known as vectorization/SIMD-style parallelism. Increasing it will make the data path through the design wider, and thus increase the width of all buses in the design. It is limited at 512 bits, since this is the maximum we can read from an AXI Master interface in a single cycle. This corresponds to 8x 64-bit numbers (e.g., double
), 16x 32-bit numbers (e.g., float
), 32x 16-bit numbers (e.g., half)
), etc.
Hi, I followed your advice and do some experiments with different memory tile size and matrix size configuration, but the result is confusing:
First, I changed either dimension of matrix size(size_m, size_k, size_n). when size_k is bigger, the performance is worse than the baseline result(size_m=size_k=size_n=512), which is contradict to the fig.8 in the paper(but I think the paper result is more reasonable), and when I change size_n and size_m, the result is not equal, since the two parameter is dual. Second, I did according to what you said, to increase the matrix size, and I also increased the memory tile size, but both synthesis result is much worse than the baseline result.
The result is really confused me, here is the setting details and experiment result, is there anything wrong with my settings? Thank you!
I'm also a bit confused by this. Did you actually run these in hardware, or is this just based on reports from HLS? I suspect that the tool does not accurately predict the runtime. I also suggest running with MM_DYNAMIC_SIZES=ON
so you can run different matrix sizes using the same bitstream.
Maybe this is one of the reasons, these are HLS report results. And I also discussed this problem with my workmates, they provided me with several suggestions:
And MM_DYNAMIC_SIZE is used to set the dynamic size of three out-most for-loop, what's its effect for synthesis if set with ON?
If the tile sizes are reasonably large you really shouldn't see any bandwidth problems. And more importantly, the HLS tool will not detect this. So I don't think this is your problem!
And MM_DYNAMIC_SIZE is used to set the dynamic size of three out-most for-loop, what's its effect for synthesis if set with ON?
MM_DYNAMIC_SIZE=ON
means that you can run it on any size of matrix. The reason why it only touches the outer loop bound in the code is because the kernel will always compute a full tile (even if all results are not needed). It will ignore all contributions that are not in bounds.
Let me know what results you see in hardware!
Hi, I made the kernel run on U250. Here I have two questions:
And does the "chiplet" you mentioned in fig.7 in the paper correspond to the SLR?
And does the "chiplet" you mentioned in fig.7 in the paper correspond to the SLR?
Yes, chiplet refers to SLRs.
- When I synthesis half precision GEMM, there is no problem during "make hw", but when I run ./RunHardware, it shows that "./RunHardware.exe: error while loading shared libraries: libgmp.so.7: cannot open shared object file: No such file or directory", have you met similar problems before?
Ah, that's annoying. It must mean the the rpath of the executable is not set properly. You should be able to work around this by adding the directory of the Xilinx floating point libraries to your LD_LIBRARY_PATH
when you run the executable, e.g.:
export LD_LIBRARY_PATH=/opt/Xilinx/Vitis_HLS/2021.2/lnx64/tools/fpo_v7_0
- I synthesize successfully using float precision with the default setting in the github CMakeList.txt, I found that in one single SLR utilization is full, but the total utilization is below 20%. And the performance for a 512 512 512 GEMM(dynamic size is ON) is 128.048 Gops, with the matrix size increase, the performance will also be improved. But when I increase MM_PARALLELISM_M or MM_PARALLELISM_N, it will report routing congestion during link stage. In that case, how can I adjust these parameters to increase the utilization for multi SLR?
Unfortunately there is no easy answer to this question, since the routing is quite chaotic. For the paper, I determined the highest parallelism I could achieve empirically. You can try the parameters that I used in the paper. Potentially even higher parallelism should be possible on the U250 :-) I recommend not making the vector parallelism too high (MM_PARALLELISM_M
), keep it at 128 or 256-bit width (e.g., 8 for floats, 16 for half), since this makes routing harder.
Ideally the could should explicitly handle the mapping between multi-SLRs, but I have not implemented this.
Thanks. But when I try to compile uint8 gemm with the parameter in paper, it reports error like this:
And my CMakeList file is here: CMakeLists.txt
This is a very large example to run in simulation. Simulation is quite slow compared to running in hardware. Could it be that it is simply taking extremely long? Try to see if it finishes for a smaller example in simulation, or run it in hardware.
Hi, I synthesized the GEMM with various data type, as well as my data type, the performance is really good. When I run cmake, sometimes it will report warning https://github.com/spcl/gemm_hls/blob/7c790eb660aca2754ea8026fb2e911c47565ef8c/CMakeLists.txt#L57, is this warning neglectable? Since I saw the uint data type configuration in Table 2 neglected this warning, but FP data type obeyed this warning.
The warning means that you might not get full throughput, because the component that needs to feed values of A cannot operate as fast as the systolic array :-) You can make it go away by increasing the tile size further, if that's possible for you.
Hi, I have question again. I'm trying to find the hardware resource consumption report. Which report is the hardware utilization report for the GEMM? And I found in report/link/imp/ directory, there are many reports: Is impl_Kernel_utl and impl_full_util the GEMM kernel and system hardware utilization report? Thanks!
I would look at impl_1_full_util_routed.rpt
:-)
and where can you get the frequency value?
and where can you get the frequency value?
If the path to your compiled xclbin file is foo.xclbin
, then:
xclbinutil --info --input foo.xclbin
Thanks!
in addition, is there anyway to get power for Alveo card when running the code?
I don't think there's currently a programmatic way. You would have to use their CLI: https://xilinx.github.io/Alveo-Cards/master/debugging/build/html/docs/common-steps.html#monitor-card-power-and-temperature
Hi Johannes, Thanks a lot for your help on answering the gemm_hls and hlslib project questions in recent days. I have a question on the parameter configuration for the GEMM. Then I explored the GEMM parameter, but the result is far from the result provided in Tab.2 in the paper, so my questions are: 1) What's your parameter configuration details for Tab.2? 2) From my understanding, the xb, yb in the paper is kInnerTilesN, kInnerTilesM in the code, and xm, ym is OuterTilesN(size_n), OuterTilesM(size_m) in the code, xp is number of PEs, and yc is number of parallel MAC units in each PEs. But what's the meaning of xbxm, ybym in Tab.2? 3) How is the performance in Tab.2 calculated? Does the latency for calculating performance cover the the whole processing period, including data read and write back? These are my questions, and looking forward to your reply.
P.S. Thanks very much for the project. It saved me a lot of time for implementing my project, and the code is really nice, I like it!