Support for edge platform

mbaharan commented 4 years ago

Thanks for the great repo. I just wanted to see if there is an effort to support edge platforms such as ZCU102? I am currently working on it, but I am not sure to start. I have already changed the cmake files to redirect GCC to aarch64 and XRT to edge platform, but I am getting the following error for compiling RunHardware.cpp:

In file included from /mnt/500GB/home/mbaharan/gemm_hls/include/Utility.h:13,
                 from /mnt/500GB/home/mbaharan/gemm_hls/host/RunHardware.cpp:11:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:32:36: error: ‘CL_MEM_EXT_PTR_XILINX’ was not declared in this scope
 constexpr auto kXilinxMemPointer = CL_MEM_EXT_PTR_XILINX;
                                    ^~~~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:32:36: note: suggested alternative: ‘CL_MEM_HOST_PTR’
 constexpr auto kXilinxMemPointer = CL_MEM_EXT_PTR_XILINX;
                                    ^~~~~~~~~~~~~~~~~~~~~
                                    CL_MEM_HOST_PTR
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:33:31: error: ‘XCL_MEM_DDR_BANK0’ was not declared in this scope
 constexpr auto kMemoryBank0 = XCL_MEM_DDR_BANK0;
                               ^~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:34:31: error: ‘XCL_MEM_DDR_BANK1’ was not declared in this scope
 constexpr auto kMemoryBank1 = XCL_MEM_DDR_BANK1;
                               ^~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:35:31: error: ‘XCL_MEM_DDR_BANK2’ was not declared in this scope
 constexpr auto kMemoryBank2 = XCL_MEM_DDR_BANK2;
                               ^~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:36:31: error: ‘XCL_MEM_DDR_BANK3’ was not declared in this scope
 constexpr auto kMemoryBank3 = XCL_MEM_DDR_BANK3;
                               ^~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:37:31: error: ‘cl_mem_ext_ptr_t’ does not name a type; did you mean ‘cl_mem_ext_host_ptr’?
 using ExtendedMemoryPointer = cl_mem_ext_ptr_t;
                               ^~~~~~~~~~~~~~~~
                               cl_mem_ext_host_ptr
In file included from /mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:60,
                 from /mnt/500GB/home/mbaharan/gemm_hls/include/Utility.h:13,
                 from /mnt/500GB/home/mbaharan/gemm_hls/host/RunHardware.cpp:11:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:637:3: error: ‘ExtendedMemoryPointer’ does not name a type
   ExtendedMemoryPointer CreateExtendedPointer(void *hostPtr,
   ^~~~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h: In constructor ‘hlslib::ocl::Buffer< <template-parameter-1-1>, <anonymous> >::Buffer(hlslib::ocl::Context&, hlslib::ocl::MemoryBank, IteratorType, IteratorType)’:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:418:5: error: ‘ExtendedMemoryPointer’ was not declared in this scope
     ExtendedMemoryPointer extendedHostPointer;
     ^~~~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:420:7: error: ‘extendedHostPointer’ was not declared in this scope
       extendedHostPointer = CreateExtendedPointer(hostPtr, memoryBank);
       ^~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:420:29: error: there are no arguments to ‘CreateExtendedPointer’ that depend on a template parameter, so a declaration of ‘CreateExtendedPointer’ must be available [-fpermissive]
       extendedHostPointer = CreateExtendedPointer(hostPtr, memoryBank);
                             ^~~~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:420:29: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h: In constructor ‘hlslib::ocl::Buffer< <template-parameter-1-1>, <anonymous> >::Buffer(hlslib::ocl::Context&, hlslib::ocl::MemoryBank, size_t)’:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:470:5: error: ‘ExtendedMemoryPointer’ was not declared in this scope
     ExtendedMemoryPointer extendedHostPointer;
     ^~~~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:472:7: error: ‘extendedHostPointer’ was not declared in this scope
       extendedHostPointer = CreateExtendedPointer(nullptr, memoryBank);
       ^~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:472:29: error: there are no arguments to ‘CreateExtendedPointer’ that depend on a template parameter, so a declaration of ‘CreateExtendedPointer’ must be available [-fpermissive]
       extendedHostPointer = CreateExtendedPointer(nullptr, memoryBank);
                             ^~~~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h: In instantiation of ‘hlslib::ocl::Buffer< <template-parameter-1-1>, <anonymous> >::Buffer(hlslib::ocl::Context&, hlslib::ocl::MemoryBank, size_t) [with T = hlslib::DataPack<float, 16>; hlslib::ocl::Access access = (hlslib::ocl::Access)0; size_t = long unsigned int]’:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:896:10:   required from ‘hlslib::ocl::Buffer<T, access> hlslib::ocl::Context::MakeBuffer(Ts&& ...) [with T = hlslib::DataPack<float, 16>; hlslib::ocl::Access access = (hlslib::ocl::Access)0; Ts = {hlslib::ocl::MemoryBank, long unsigned int}]’
/mnt/500GB/home/mbaharan/gemm_hls/host/RunHardware.cpp:136:72:   required from here
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:472:50: error: ‘CreateExtendedPointer’ was not declared in this scope
       extendedHostPointer = CreateExtendedPointer(nullptr, memoryBank);
                             ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h: In instantiation of ‘hlslib::ocl::Buffer< <template-parameter-1-1>, <anonymous> >::Buffer(hlslib::ocl::Context&, hlslib::ocl::MemoryBank, size_t) [with T = hlslib::DataPack<float, 16>; hlslib::ocl::Access access = (hlslib::ocl::Access)1; size_t = long unsigned int]’:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:896:10:   required from ‘hlslib::ocl::Buffer<T, access> hlslib::ocl::Context::MakeBuffer(Ts&& ...) [with T = hlslib::DataPack<float, 16>; hlslib::ocl::Access access = (hlslib::ocl::Access)1; Ts = {hlslib::ocl::MemoryBank, long unsigned int}]’
/mnt/500GB/home/mbaharan/gemm_hls/host/RunHardware.cpp:141:76:   required from here
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:472:50: error: ‘CreateExtendedPointer’ was not declared in this scope
CMakeFiles/RunHardware.exe.dir/build.make:62: recipe for target 'CMakeFiles/RunHardware.exe.dir/host/RunHardware.cpp.o' failed
make[2]: *** [CMakeFiles/RunHardware.exe.dir/host/RunHardware.cpp.o] Error 1
CMakeFiles/Makefile2:242: recipe for target 'CMakeFiles/RunHardware.exe.dir/all' failed
make[1]: *** [CMakeFiles/RunHardware.exe.dir/all] Error 2
Makefile:94: recipe for target 'all' failed
make: *** [all] Error 2

and this cmake configuration output:

-- The C compiler identification is GNU 8.2.0
-- The CXX compiler identification is GNU 8.2.0
-- Check for working C compiler: /mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/bin/aarch64-xilinx-linux/aarch64-xilinx-linux-gcc
-- Check for working C compiler: /mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/bin/aarch64-xilinx-linux/aarch64-xilinx-linux-gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/bin/aarch64-xilinx-linux/aarch64-xilinx-linux-g++
-- Check for working CXX compiler: /mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/bin/aarch64-xilinx-linux/aarch64-xilinx-linux-g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of float
-- Check size of float - done
-- Using user defined Xilinx Runtime (XRT) directory "/mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/aarch64-xilinx-linux/usr/".
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib64/libOpenCL.so (found version "2.2") 
-- Found Vitis: /tools/Xilinx/Vitis/2019.2/bin/v++  
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- A library with BLAS API not found. Please specify library location.
-- Configuring done
-- Generating done

As I know ZCU also has four banks for its DDR. Any help would be greatly appreciated.

definelicht commented 4 years ago

Hi there. I don't have an embedded board to test on myself, but from the error you're getting, my guess is that a system OpenCL header is being included instead of the Xilinx OpenCL header, so the Xilinx components are missing.

This might have been fixed by a newer version of hlslib, so I will update the hlslib version on Monday. In the meantime, you can try to:

Go to the hlslib subdirectory and do git pull origin master and run again, to see if this works out of the box.
Debug yourself, but running make VERBOSE=1 and making sure the Xilinx OpenCL headers are present in one of the included folders.

mbaharan commented 4 years ago

Thanks for the help. So I fixed the problem of XRT and cl_ext.h. For embedded edge platform, the XRT root directory is different. I modified the FindVitis.cmake to fix it. Now I have the following linking error and it is because of different architecture:

Scanning dependencies of target PrintSpecifications
[ 10%] Building CXX object CMakeFiles/PrintSpecifications.dir/src/PrintSpecifications.cpp.o
[ 20%] Linking CXX executable PrintSpecifications
[ 20%] Built target PrintSpecifications
Scanning dependencies of target mmkernel
[ 30%] Building CXX object CMakeFiles/mmkernel.dir/kernel/Compute.cpp.o
[ 40%] Building CXX object CMakeFiles/mmkernel.dir/kernel/Memory.cpp.o
[ 50%] Building CXX object CMakeFiles/mmkernel.dir/kernel/Top.cpp.o
[ 60%] Linking CXX static library libmmkernel.a
[ 60%] Built target mmkernel
Scanning dependencies of target RunHardware.exe
[ 70%] Building CXX object CMakeFiles/RunHardware.exe.dir/host/RunHardware.cpp.o
In file included from /mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/SDAccel.h:60,
                 from /mnt/500GB/home/mbaharan/gemm_hls/include/Utility.h:13,
                 from /mnt/500GB/home/mbaharan/gemm_hls/host/RunHardware.cpp:11:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h: In function ‘cl_mem_flags hlslib::ocl::{anonymous}::BankToFlag(hlslib::ocl::MemoryBank, bool)’:
/mnt/500GB/home/mbaharan/gemm_hls/hlslib/include/hlslib/xilinx/../common/OpenCL.h:243:1: warning: control reaches end of non-void function [-Wreturn-type]
 }
 ^
[ 80%] Linking CXX executable RunHardware.exe
/mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/libexec/aarch64-xilinx-linux/gcc/aarch64-xilinx-linux/8.2.0/real-ld: skipping incompatible /tools/Xilinx/Vitis/2019.2/lnx64/tools/fpo_v7_0/libIp_floating_point_v7_0_bitacc_cmodel.so when searching for -lIp_floating_point_v7_0_bitacc_cmodel
/mnt/2TB/WorkingDir/FPGA/Vitis_Embedded_Platform_Source/Xilinx_Official_Platforms/zcu102_base/platform_repo/sysroot/sysroots/x86_64-petalinux-linux/usr/libexec/aarch64-xilinx-linux/gcc/aarch64-xilinx-linux/8.2.0/real-ld: cannot find -lIp_floating_point_v7_0_bitacc_cmodel
collect2: error: ld returned 1 exit status
CMakeFiles/RunHardware.exe.dir/build.make:86: recipe for target 'RunHardware.exe' failed
make[2]: *** [RunHardware.exe] Error 1
CMakeFiles/Makefile2:242: recipe for target 'CMakeFiles/RunHardware.exe.dir/all' failed
make[1]: *** [CMakeFiles/RunHardware.exe.dir/all] Error 2
Makefile:94: recipe for target 'all' failed

As you can see ARM ld is trying to link the RunHardware.exe against libIp_floating_point_v7_0_bitacc_cmodel.so. I am afraid the so file does not exist for ARM architecture. I am wondering how to fix this. My plan was to make it running for ZCU102, then change all the computation type to integer rather than floating-point.

mbaharan commented 4 years ago

So I fixed the previous problem, and I am able to run the program on the FPGA, but I am getting the following error:

root@xilinx-zcu102-2019_2:/mnt# ./RunHardware.exe hw
Initializing host memory... Done.
Initializing OpenCL context...
Programming device...
Initializing device memory...
XRT build version: 2.3.0
Build hash: 1eb61547b241c1a5a7aaee4645d6d481fb0f25d6
Build date: 2019-11-05 18:58:42
Git branch: devtool
PID: 2602
UID: 0
[Sun Jul 12 21:31:53 2020]
HOST: xilinx-zcu102-2019_2
EXE: /mnt/RunHardware.exe
[XRT] ERROR: std::bad_alloc
Execution failed with error: "Failed to initialize device memory.".

Any thoughts or ideas? Thanks

definelicht commented 4 years ago

Have you pushed your changes to a fork so I can see what you needed to change to make it work? It would be good to integrate this into the main repository.

Regarding libIp_floating_point_v7_0_bitacc_cmodel.so: this is only required for half precision, but currently I indiscriminately link against it. I will change CMake to only link against it if the data type is half.

Regarding the device memory issue: I think I know what the problem is, let me take a look.

definelicht commented 4 years ago

I found the bug for your latest error ("Failed to initialize device memory.") and have pushed a fix. Please let me know if this solves it.

mbaharan commented 4 years ago

Thanks for the update. Still facing the same problem. I am going to develop a simple vector add based on hlslib for ZCU102 to see if we will face the same problem or not. I will update you ASAP.

definelicht commented 4 years ago

Strange, the issue I found was that the code in host/RunHardware.cpp was allocating memory to two banks, even when MM_TWO_DIMMS was not set. You did not set this variable to true, right?

Can you double check that the host code you are running is not specifying any memory banks?

mbaharan commented 4 years ago

Based on the default cmake parameters mentioned in README.md, the size of matrix A, and B is 16GB. Correct me if I am wrong. ZCU102 has only 4GB shared memory, and it doesn't have a device memory. Although there is a 512MB dedicated memory for PL, the platform I am using doesn't support it. So I re-run the make with the following configuration:

cmake ../ -DMM_DATA_TYPE=float -DMM_SIZE_N=1024 -DMM_SIZE_M=1024 -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512

and I got the following output:

root@xilinx-zcu102-2019_2:/mnt# ./RunHardware.exe hw
Initializing host memory... Done.
Initializing OpenCL context...
Programming device...
Initializing device memory...
Memory is created...
Doing the rest of the things...
Copying memory to device...
Creating kernel...
Executing kernel...
Kernel executed in 0.0181169 seconds, corresponding to a performance of 59.2674 GOp/s.
Copying back result...
Running reference implementation...
WARNING: BLAS not available, so I'm falling back on a naive implementation. This will take a long time for large matrix sizes.
Verifying result...
Mismatch at (485, 560): 0 vs. 16382.2

I am not sure why there is a mismatch, but this is the next step that I will work on it. I need to re-read the paper for MMM implementation; however, I have a question. MM_MEMORY_BUS_WIDTH_N is the AXI stream packet size or it is the actual bit width of the memory? I need to double-check it for ZCU102. I am going to fork your repo and add ZCU102 support to it. When I finalized and debug the whole procedure, you can add it to your master repo.

definelicht commented 4 years ago

So you solved the issue with "Failed to initialize device memory"?

This mismatch is indeed surprising, since it seems to be at a really random index. Usually problems are at the edges. Can you check how many zeros/mismatches are present in the full matrix?

MM_MEMORY_BUS_WIDTH_N is the width (in bytes) of the data bus to the AXI master interface, which is converted to the appropriate data width during runtime. The internal streaming interfaces have widths defined by kComputeTileSizeN and kComputeTileSizeM, the former of which is currently always 1. Generally I would recommend leaving this at 64 bytes for all interfaces, unless you are using something else than DDR4.

definelicht commented 4 years ago

Regarding forking: depending on how many changes are necessary, I would much prefer if you submit each issue that you needed to fix as a separate pull request (for example, one PR for fixing CMake, one PR for fixing a memory issue, etc.). Otherwise we risk that I will want to integrate some, but not all, of your changes, and then it will not be possible to merge :-) Thanks!

mbaharan commented 4 years ago

So I had some time today, and I pushed the modified source code and stuff to my forked repo. Here is the link: https://github.com/mbaharan/gemm_hls This way, you can also see the changes that I have made. I also configured and re-synthesized the code for uint8_t this time by running the cmake with following configuration:

cmake ../ -DMM_DATA_TYPE=uint8_t -DMM_SIZE_N=512 -DMM_SIZE_M=512 -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512 -DXRT_ROOT_DIR=$XRT_ROOT_DIR -DOpenCL_LIBRARIES=$SDKTARGETSYSROOT/usr/lib/ -DOpenCL_INCLUDE_DIRS=$SDKTARGETSYSROOT/usr/include/ -DCMAKE_SYSTEM_PROCESSOR=$CMAKE_SYSTEM_PROCESSOR -DCMAKE_SYSTEM_NAME=$CMAKE_SYSTEM_NAME

results.log is also the output of the system. The first 16x512 is wrong. I am not sure what is the cause of the problem and mismatches. I am still working on it. I am thinking about how the memory is shared between PL and PS as there is no device memory like U50 or your evaluation board. We have a solid working solution on U50 but not for ZCU102. Let me know what you think. Thanks.

definelicht commented 4 years ago

This is strange indeed. If anything I would expect values at the end to be wrong, not values at the beginning. These types of errors are usually related to the memory copies to/from the device, not to the computation itself. Perhaps you can try verifying that all matrices have the values that you expect before the computation starts.

512 is the memory tile size, but I'm unsure where 16 comes from: this is less than the transpose width of 64 bytes. Did you try any other data types than uint8_t?

definelicht commented 4 years ago

Any news @mbaharan?

mbaharan commented 4 years ago

Sorry for the late response. Still not any sort of progress. What I am doing right now is developing a simple vector add based on the hlslib and specifically DataPack to understand what is the cause of the problem. I Will definitely update you if I have any success. Thanks.

definelicht commented 3 years ago

Closing due to inactivity. Feel free to reopen if you have any updates.

spcl / gemm_hls

Support for edge platform #13