cl.hpp missing - OpenCL version mismatch?

PietroGhg commented 3 years ago

Hello everyone, I was trying to run the GEMM benchmark, cmake configuration succeeds but building fails with error CL/cl.hpp, no such file or directory. In my /usr/include/CL directory i don't have such file. May this be due to a mismatch in our OpenCL versions? My cl2.hpp file mentions version 2.0.7. I have also trying setting USE_DEPRECATED_CPP_HEADER to false, in this case it fails with a sintax error in cl2.hpp, line 7841, expected ; at the end of method declaration. Thanks Pietro

Mellich commented 3 years ago

Hello Pietro,

The GEMM version in master is not updated to the new header, yet. You need to remove the old header in the GEMM host code and slightly change the kernel call to make it compatible. You can find the changes in #8. After the changes, I am able to compile and run the host code with

cmake -DUSE_DEPRECATED_HPP_HEADER=No make GEMM_intel

Please tell me if this resolves the issue for you.

Best Regards, Marius

PietroGhg commented 3 years ago

Hello, thanks for your reply. After your changes I still cannot build. I get the following error. In file included from /usr/include/CL/cl.h:30:0, from /usr/include/CL/opencl.h:42, from /usr/include/CL/cl2.hpp:453, from /home/fpga/pietro/hpccv2/HPCC_FPGA/shared/include/setup/fpga_setup.hpp:37, from /home/fpga/pietro/hpccv2/HPCC_FPGA/shared/setup/fpga_setup.cpp:5: /usr/include/CL/cl2.hpp:7841:30: error: expected ‘;’ at end of member declaration Event* event = NULL) CL_EXT_SUFFIX__VERSION_1_2_DEPRECATED const ^ In file included from /home/fpga/pietro/hpccv2/HPCC_FPGA/shared/include/setup/fpga_setup.hpp:37:0, from /home/fpga/pietro/hpccv2/HPCC_FPGA/shared/setup/fpga_setup.cpp:5: /usr/include/CL/cl2.hpp:7842:5: error: expected unqualified-id before ‘{’ token {

Looking at cl2.hpp, the line that raises the error appears after #if defined(CL_USE_DEPRECATED_OPENCL_1_2_APIS)

Your project doesn't seem to set this flag, but by running grep -r CL_USE_DEPRECATED_OPENCL_1_2_APIS, I have found the following: build/GEMM/_deps/extern_hlslib-src/include/hlslib/xilinx/SDAccel.h:#define CL_USE_DEPRECATED_OPENCL_1_2_APIS.

At this point I don't know if it something that you can fix on your side. I have manged to compile by manually adding the cl.hpp file in my include dir, so feel free to close the issue if you find it appropriate. Thanks

Mellich commented 3 years ago

Thank you for providing a temporary fix for this issue. I wasn't aware that you are using the Xilinx toolchain here. I will have an additional look at it.

PietroGhg commented 3 years ago

Hello, I managed to compile the benchmark with the fix you provided, I had some runtime problems (segfaults) when trying to execute it, since apparently the Xilinx Runtime does not implement some methods from the OpenCL C Api (such as clCreateCommandQueueWithProperties). So I ended up rewriting the calculate function using only C Apis that are supported by the XRT. Right now execution terminates without any error, but results are incorrect (all zeros). This is my version of the calculate function, it's my first time dealing with OpenCL so maybe I have messed something up. calculate.txt

Mellich commented 3 years ago

Hello,

Results are zeros because you do not calculate. Kernel argument 8 and 9 are the minimum and maximum column of blocks in the result matrix that are calculated by the kernel. Multiple kernels will divide the calculation of the result matrix columwise! So in your case you are only using a single kernel replication. Thus, argument 8 must be 0 and argument 9 must be the same as argument 7.

So in your code the last three kernel arguments should be:

e = clSetKernelArg(kernel,6,sizeof(size_in_blocks),&size_in_blocks);
// check_status(e);
cl_uint nbpk = static_cast<cl_uint>(0);
e = clSetKernelArg(kernel,7,sizeof(nbpk),&nbpk);
// check_status(e);
e = clSetKernelArg(kernel,8,sizeof(min),&size_in_blocks);
// check_status(e);

This should work if you only need a single kernel replication. However, on many FPGAs you will need multiple kernels to better utilise the FPGA resources!

PietroGhg commented 3 years ago

Thanks, I removed kernel replications in order to simplify the C++ to C transition, I'll try with your suggestion and eventually re-implement kernel replication in order to maximize performances.

PietroGhg commented 3 years ago

Hello, sorry for getting back to you this late. I think I managed to get the benchmark to run on the zcu106. I'll test a bit more and eventually either post here the instructions or make a pull request. I noticed that if I set a very small number for the -m and -b command line options, the benchmark hangs, it doesn't raise any error but execution doesn't terminate (e.g. -m 1 -b 8 and MATRIX_SIZE=2 and BLOCK_SIZE=256), do you know why?

Mellich commented 3 years ago

Hey Pietro,

that sounds good! Thank you to consider sharing your work with the community.

The definition of the block sizes is unfortunately not uniform among all benchmarks. In LINPACK, you give the block size as log2 (8 if you want to have a block size of 256), whereas in the GEMM benchmark you have to use the total number directly (256). We are planning to unify the behavior to the log2 case in the future since it is less error-prone.

So if you have defined a block size of 256 for your design, you also need to use -b 256 with your host code (or leave it blank to use the default which should work fine). They have to match, otherwise, you will get undefined behavior. The option is intended for use cases where you want to run multiple designs with the same host code.

I don't know why the execution hangs, but in your example, it will most likely read and write from unallocated global memory on the FPGA, so maybe this is a side effect.

PietroGhg commented 3 years ago

So if I want to change the matrix size, should I rebuild the benchmark and re-synthesize the kernel?

Mellich commented 3 years ago

No, the matrix size should be adjustable during runtime. So it also stalls if you set -m to 2 and keep -b at 256? Can you give more information about your configuration and at what point the execution is stalling exactly?

PietroGhg commented 3 years ago

Execution hangs after the kernel has been issued. But everything works fine if i keep -b at 256 (same as the cmake var) and use -m to choose the matrix size. To build the benchmark i have rewritten the calculate function, using clCreateCommandQueue to create a command queue since at the moment XRT doesn't seem to implement clCreateCommandQueueWithProperties. calculate.txt The zcu106 board is an embedded board with an ARM64 cpu, so you will need OpenCL libs built for arm64. You will also need to configure the platform using vivado and petalinux in order to have a platform .xpfm file, and a sysroot. Then this is my cmake invocation: cmake_invo.txt The first invocation will fail, you will need to edit the build/_deps/extern_hlslib-src/cmake/FindVitis.cmake file, changing line 97 to if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)|(ARM64)") and line 192 to set(Vitis_LIBRARIES $ENV{OpenCL_LIBRARIES} ${Vitis_LIBXILINXOPENCL}). Then you can re-run the cmake invocation and build. This is an example of my toolchain file used to crosscompile: toolchain.txt Make also sure to have the right settings in your .ini files. Thanks again for you patience and your support :)

Mellich commented 3 years ago

Thank you for providing your changes to get things running! I would also be very interested in the final configuration, resource utilization, and measurement results on this board since it is quite different to the usual targets. Could you please provide them to me as well?

pc2 / HPCC_FPGA

cl.hpp missing - OpenCL version mismatch? #7