Closed mbaharan closed 3 years ago
Hi there. I don't have an embedded board to test on myself, but from the error you're getting, my guess is that a system OpenCL header is being included instead of the Xilinx OpenCL header, so the Xilinx components are missing.
This might have been fixed by a newer version of hlslib, so I will update the hlslib version on Monday. In the meantime, you can try to:
git pull origin master
and run again, to see if this works out of the box.make VERBOSE=1
and making sure the Xilinx OpenCL headers are present in one of the included folders.Thanks for the help. So I fixed the problem of XRT and cl_ext.h. For embedded edge platform, the XRT root directory is different. I modified the FindVitis.cmake
to fix it. Now I have the following linking error and it is because of different architecture:
Scanning dependencies of target PrintSpecifications
[ 10%] Building CXX object CMakeFiles/PrintSpecifications.dir/src/PrintSpecifications.cpp.o
[ 20%] Linking CXX executable PrintSpecifications
[ 20%] Built target PrintSpecifications
Scanning dependencies of target mmkernel
[ 30%] Building CXX object CMakeFiles/mmkernel.dir/kernel/Compute.cpp.o
[ 40%] Building CXX object CMakeFiles/mmkernel.dir/kernel/Memory.cpp.o
[ 50%] Building CXX object CMakeFiles/mmkernel.dir/kernel/Top.cpp.o
[ 60%] Linking CXX static library libmmkernel.a
[ 60%] Built target mmkernel
Scanning dependencies of target RunHardware.exe
[ 70%] Building CXX object CMakeFiles/RunHardware.exe.dir/host/RunHardware.cpp.o
In file included from /mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:60,
from /mnt/500GB/home/mbaharan/gemm_hls/include/Utility.h:13,
from /mnt/500GB/home/mbaharan/gemm_hls/host/RunHardware.cpp:11:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h: In function ‘cl_mem_flags hlslib::ocl::{anonymous}::BankToFlag(hlslib::ocl::MemoryBank, bool)’:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:243:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
[ 80%] Linking CXX executable RunHardware.exe
/mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/libexec/aarch64-xilinx-linux/gcc/aarch64-xilinx-linux/8.2.0/real-ld: skipping incompatible /tools/Xilinx/Vitis/2019.2/lnx64/tools/fpo_v7_0/libIp_floating_point_v7_0_bitacc_cmodel.so when searching for -lIp_floating_point_v7_0_bitacc_cmodel
/mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/libexec/aarch64-xilinx-linux/gcc/aarch64-xilinx-linux/8.2.0/real-ld: cannot find -lIp_floating_point_v7_0_bitacc_cmodel
collect2: error: ld returned 1 exit status
CMakeFiles/RunHardware.exe.dir/build.make:86: recipe for target 'RunHardware.exe' failed
make[2]: *** [RunHardware.exe] Error 1
CMakeFiles/Makefile2:242: recipe for target 'CMakeFiles/RunHardware.exe.dir/all' failed
make[1]: *** [CMakeFiles/RunHardware.exe.dir/all] Error 2
Makefile:94: recipe for target 'all' failed
As you can see ARM ld
is trying to link the RunHardware.exe
against libIp_floating_point_v7_0_bitacc_cmodel.so
. I am afraid the so
file does not exist for ARM architecture.
I am wondering how to fix this.
My plan was to make it running for ZCU102, then change all the computation type to integer rather than floating-point.
So I fixed the previous problem, and I am able to run the program on the FPGA, but I am getting the following error:
root@xilinx-zcu102-2019_2:/mnt# ./RunHardware.exe hw
Initializing host memory... Done.
Initializing OpenCL context...
Programming device...
Initializing device memory...
XRT build version: 2.3.0
Build hash: 1eb61547b241c1a5a7aaee4645d6d481fb0f25d6
Build date: 2019-11-05 18:58:42
Git branch: devtool
PID: 2602
UID: 0
[Sun Jul 12 21:31:53 2020]
HOST: xilinx-zcu102-2019_2
EXE: /mnt/RunHardware.exe
[XRT] ERROR: std::bad_alloc
Execution failed with error: "Failed to initialize device memory.".
Any thoughts or ideas? Thanks
Have you pushed your changes to a fork so I can see what you needed to change to make it work? It would be good to integrate this into the main repository.
Regarding libIp_floating_point_v7_0_bitacc_cmodel.so
: this is only required for half precision, but currently I indiscriminately link against it. I will change CMake to only link against it if the data type is half
.
Regarding the device memory issue: I think I know what the problem is, let me take a look.
I found the bug for your latest error ("Failed to initialize device memory.") and have pushed a fix. Please let me know if this solves it.
Thanks for the update. Still facing the same problem. I am going to develop a simple vector add based on hlslib
for ZCU102 to see if we will face the same problem or not. I will update you ASAP.
Strange, the issue I found was that the code in host/RunHardware.cpp
was allocating memory to two banks, even when MM_TWO_DIMMS
was not set. You did not set this variable to true, right?
Can you double check that the host code you are running is not specifying any memory banks?
Based on the default cmake
parameters mentioned in README.md
, the size of matrix A, and B is 16GB. Correct me if I am wrong. ZCU102 has only 4GB shared memory, and it doesn't have a device memory. Although there is a 512MB dedicated memory for PL, the platform I am using doesn't support it. So I re-run the make with the following configuration:
cmake ../ -DMM_DATA_TYPE=float -DMM_SIZE_N=1024 -DMM_SIZE_M=1024 -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512
and I got the following output:
root@xilinx-zcu102-2019_2:/mnt# ./RunHardware.exe hw
Initializing host memory... Done.
Initializing OpenCL context...
Programming device...
Initializing device memory...
Memory is created...
Doing the rest of the things...
Copying memory to device...
Creating kernel...
Executing kernel...
Kernel executed in 0.0181169 seconds, corresponding to a performance of 59.2674 GOp/s.
Copying back result...
Running reference implementation...
WARNING: BLAS not available, so I'm falling back on a naive implementation. This will take a long time for large matrix sizes.
Verifying result...
Mismatch at (485, 560): 0 vs. 16382.2
I am not sure why there is a mismatch, but this is the next step that I will work on it. I need to re-read the paper for MMM implementation; however, I have a question. MM_MEMORY_BUS_WIDTH_N
is the AXI stream packet size or it is the actual bit width of the memory? I need to double-check it for ZCU102. I am going to fork your repo and add ZCU102 support to it. When I finalized and debug the whole procedure, you can add it to your master repo.
So you solved the issue with "Failed to initialize device memory"?
This mismatch is indeed surprising, since it seems to be at a really random index. Usually problems are at the edges. Can you check how many zeros/mismatches are present in the full matrix?
MM_MEMORY_BUS_WIDTH_N
is the width (in bytes) of the data bus to the AXI master interface, which is converted to the appropriate data width during runtime. The internal streaming interfaces have widths defined by kComputeTileSizeN
and kComputeTileSizeM
, the former of which is currently always 1. Generally I would recommend leaving this at 64 bytes for all interfaces, unless you are using something else than DDR4.
Regarding forking: depending on how many changes are necessary, I would much prefer if you submit each issue that you needed to fix as a separate pull request (for example, one PR for fixing CMake, one PR for fixing a memory issue, etc.). Otherwise we risk that I will want to integrate some, but not all, of your changes, and then it will not be possible to merge :-) Thanks!
So I had some time today, and I pushed the modified source code and stuff to my forked repo. Here is the link: https://github.com/mbaharan/gemm_hls This way, you can also see the changes that I have made. I also configured and re-synthesized the code for uint8_t this time by running the cmake with following configuration:
cmake ../ -DMM_DATA_TYPE=uint8_t -DMM_SIZE_N=512 -DMM_SIZE_M=512 -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512 -DXRT_ROOT_DIR=$XRT_ROOT_DIR -DOpenCL_LIBRARIES=$SDKTARGETSYSROOT/usr/lib/ -DOpenCL_INCLUDE_DIRS=$SDKTARGETSYSROOT/usr/include/ -DCMAKE_SYSTEM_PROCESSOR=$CMAKE_SYSTEM_PROCESSOR -DCMAKE_SYSTEM_NAME=$CMAKE_SYSTEM_NAME
results.log is also the output of the system. The first 16x512 is wrong. I am not sure what is the cause of the problem and mismatches. I am still working on it. I am thinking about how the memory is shared between PL and PS as there is no device memory like U50 or your evaluation board. We have a solid working solution on U50 but not for ZCU102. Let me know what you think. Thanks.
This is strange indeed. If anything I would expect values at the end to be wrong, not values at the beginning. These types of errors are usually related to the memory copies to/from the device, not to the computation itself. Perhaps you can try verifying that all matrices have the values that you expect before the computation starts.
512 is the memory tile size, but I'm unsure where 16 comes from: this is less than the transpose width of 64 bytes. Did you try any other data types than uint8_t?
Any news @mbaharan?
Sorry for the late response. Still not any sort of progress. What I am doing right now is developing a simple vector add based on the hlslib and specifically DataPack to understand what is the cause of the problem. I Will definitely update you if I have any success. Thanks.
Closing due to inactivity. Feel free to reopen if you have any updates.
Thanks for the great repo. I just wanted to see if there is an effort to support edge platforms such as ZCU102? I am currently working on it, but I am not sure to start. I have already changed the
cmake
files to redirect GCC to aarch64 and XRT to edge platform, but I am getting the following error for compiling RunHardware.cpp:and this cmake configuration output:
As I know ZCU also has four banks for its DDR. Any help would be greatly appreciated.