spcl / gemm_hls

Scalable systolic array-based matrix-matrix multiplication implemented in Vivado HLS for Xilinx FPGAs.
BSD 3-Clause "New" or "Revised" License
302 stars 51 forks source link

Mismatch result? #27

Closed xooxit closed 10 months ago

xooxit commented 2 years ago

Hi I reproduce the project following under command lines

mkdir build
cd build
cmake ../ -DMM_DATA_TYPE=float -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512 -DMM_ADD_RESOURCE=FAddSub_nodsp -DMM_MULT_RESOURCE=FMul_nodsp
make
make hw

Then run like this

./RunHardware.exe 1024 1024 1024 hw

and get mismatch result like this

Verifying result...
Mismatch at (0, 0): 30790.4 vs. 30340

I also tried to adjust the threshold of determining the mismatch result to be larger (i.e. from 1e-03 to 1e-02) and printed out all mismatched results.

Mismatch at (258, 158): 32371.9 vs. 29222.8
Mismatch at (258, 348): 33150 vs. 30126.4
Mismatch at (258, 410): 33577.4 vs. 30521.6
Mismatch at (258, 690): 32677.6 vs. 29691.8
Mismatch at (571, 31): 32113.1 vs. 29157.5
Mismatch at (571, 72): 32030.5 vs. 29066.1
Mismatch at (571, 167): 30717.5 vs. 27857.2
Mismatch at (571, 386): 32130.5 vs. 29166.1
Mismatch at (571, 414): 32495.7 vs. 29537.6
Mismatch at (571, 419): 32113.1 vs. 29166.3
Mismatch at (571, 603): 32465.7 vs. 29466.6
Mismatch at (571, 675): 32653.2 vs. 29656.1
Mismatch at (571, 962): 32408.8 vs. 29393.5
Mismatch at (643, 457): 28775.7 vs. 32123.4

My vitis version is 2021.2, xrt version is 2.12.427 and platform is xilinx_u250_gen3x16_xdma_3_1_202020_1

Btw, I learned a lot from it. Thanks for the nice work.

definelicht commented 2 years ago

Hey there! I have not seen this, but I also haven't tried running the kernel in Vitis 2021.2, because there's a serious performance issue in the memory reader code that suddenly popped up, which I haven't figured out how to fix.

Can you check if your kernel had this elevated II because of aSplit, in that it's related?

Does it pass in simulation? You can run a really small matrix so it doesn't take too long.

Did you try running any other configurations? Did they succeed/fail?

Unfortunately I don't have much time to maintain this these days, since I'm no longer affiliated with the university, so I would appreciate as much help as you can give me to figure out what the issue could be :-)

xooxit commented 2 years ago

Hi - good luck wherever u go!

By the way, I checked the asplit in in memory.cpp, every related II is set to 1 but in the v++_MatrixMultiplicationKernel_hw.log file, there is a similar issue with https://github.com/spcl/gemm_hls/issues/25.

===>The following messages were generated while  performing high-level synthesis for kernel: MatrixMultiplicationKernel Log file: /home/lab/yong/SoonToBeRemoved/gemm_hls/build-wDSP/_x/MatrixMultiplicationKernel_hw/MatrixMultiplicationKernel/vitis_hls.log :
INFO: [v++ 204-61] Pipelining loop 'ReadA_N0_ReadA_K0_ReadA_N1_ReadA_N2'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 16, Depth = 93, loop 'ReadA_N0_ReadA_K0_ReadA_N1_ReadA_N2'

The entire log files are below.

I'm gonna rebuild the kernel in hardware mode with the same configuration in Github on VITIS 2020.2 ( cmake ../ -DMM_DATA_TYPE=float -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512)

And I built the kernel in hardware mode, not simulation mode.

I ran various n,m,k combinations from n=m=k=16 to n=m=k=2048, and the number of mismatch results was getting larger. Something strange is that when the input matrix configuration is the n=m=k=16, repeated executions of the command (./RunHardware.exe 1024 1024 1024 hw) make diff inRunHardware.cpp of mismatch results (std::abs(testVal - refVal)) bigger.

charliechou1001 commented 2 years ago

Hi, I'm also using gemm_hls project to build my own work. My simulation result based on the gemm is correct.

The simulation and hardware mode do the same thing, so if the mismatch exists in hardware mode, it may also have mismatch in simulation mode. And what kind of data type you use? Is that floating point?

definelicht commented 2 years ago

@xooxit Did compiling it in 2020.2 make a difference? I'm curious if the II=16 issue is related to the verification error.

xooxit commented 2 years ago

@definelicht I did compiling again in 2021.1 (before one is 2021.2), it has no verification error and there is also no II=16 issue.

In the above verification error issue, I did build with -DMM_ADD_RESOURCE=FAddSub_nodsp -DMM_MULT_RESOURCE=FMul_nodsp. When I built without nodsp option in the same 2021.2 version, it has no verification issue but there is II=16 issue.

(I edited the corresponding build commands at the top question.)

xooxit commented 2 years ago

@charliechou1001 Hi -

I did build with nodsp option, and there were verification issues, but without nodsp option, there was no any verification error. There was no verification error in the simulation mode above in both cases, and the data type was a floating point.

definelicht commented 2 years ago

Wow, ok. So 2021.2 is slow because of II=16, and nodsp breaks 2021.2 correctness. Does nodsp also break 2021.1 correctness?

I would not recommend using FMul_nodsp, this is very expensive. FMul_fulldsp and FAddSub_nodsp is usually a good combo, since addition doesn't benefit much from DSPs, but multiply benefits a lot.

xooxit commented 2 years ago

@definelicht I see-. nodsp option in 2021.1 does not break correctness and no II=16.

There is only a verification error on 2021.2 with nodsp in both ADD and MULT.

I'm now building with -MM_ADD_RESOURCE=FAddSub_nodsp both in 2021.2 and 2021.1. BTW could u let me know which aspect of using FMul_nodsp is expensive?

definelicht commented 2 years ago

Ok, that's very strange. I suspect this is a bug on Xilinx' side, not in this repo. I think I will put a notice in the README that the accelerator is broken in 2021.2, and see if it improves in future versions, unless any new information comes up?

charliechou1001 commented 2 years ago

Hi @xooxit , I made the project worked on Alveo U250. The Vitis I use is 2020.2, and the the parameter I use in CMakeList is the default one, and I also tried doubled the memory tile size m/n to 512, both works for me. Maybe the problem lies in the tool edition.

Here is the screenshot of my result: Screenshot 2022-05-08 225543

charliechou1001 commented 2 years ago

And from my workmate's experience, different edition of HLS will lead to different synthesis result with the same code, especially the hardware resource consumption, maybe the timing or other factors, such as the mismatch problems, is related to that.

definelicht commented 2 years ago

And from my workmate's experience, different edition of HLS will lead to different synthesis result with the same code, especially the hardware resource consumption, maybe the timing or other factors, such as the mismatch problems, is related to that.

There is always a difference between different versions of the tools, but it's unfortunate if they even break the code :-(

yunchenlo commented 2 years ago

Hi all, I think I am facing the same issue.

The board is U50 and the VITIS version is 2021.2.

Here is the execution log for your reference. Hope it may help to target the bug.

Screen Shot 2022-05-23 at 10 10 13 AM

yclo

definelicht commented 2 years ago

Hi all, I think I am facing the same issue.

The board is U50 and the VITIS version is 2021.2.

Here is the execution log for your reference. Hope it may help to target the bug.

Screen Shot 2022-05-23 at 10 10 13 AM

yclo

Did you try if it works when compiled with 2021.1 or older?

yunchenlo commented 2 years ago

Yes, I try on 2021.1 and pass the test!

yclo