triSYCL / sycl

SYCL for Vitis: Experimental fusion of triSYCL with Intel SYCL oneAPI DPC++ up-streaming effort into Clang/LLVM
Other
107 stars 19 forks source link

xilinx::pipeline not providing performance #136

Open ShanoToni opened 2 years ago

ShanoToni commented 2 years ago

Hello, I am interested in the performance provided by the xilinx::pipeline and xilinx::dataflow from the pipeline test single_task_vector_add_drt_dataflow_func_local_pipeline. I have modified it to produce timing data of the kernel execution and I am comparing it to the same test but with the pipeline and dataflow disabled. Both versions of the test seem to perform with the same timing. Is this due to some missing support for pipelining from the platform I am targeting? I used :

lforg37 commented 2 years ago

Hello,

I think on this example (I guess you speak of test single_task_vector_add.cpp ?), timing will probably not change a lot between pipelined or not pipelined kernel (the body of the loop is very simple so without pipelining it should not cost more than a few cycles, and as there is also few iterations the timing is probably dominated by the kernel setup time).

Besides, it is possible that the backend chooses to pipeline the kernel even without the annotation.

You can check on the report vxx_comp_report/kernel_name/hls_report/kernel_name_csynth.rpt if the kernel has been pipelined.

Lastly, the xilinx::pipeline has gone under modification recently. These modifications have not landed on sycl/unified/master yet and can be tested with the branch sycl/unified/next.

If the test is indeed the single_task_vector_add.cpp I don't think there is something to gain by using the dataflow annotation.

ShanoToni commented 2 years ago

Hello, Thank you for the quick reply, this is the test I ran. Is it safe to assume that it is too simple as well to produce a timing difference?

lforg37 commented 2 years ago

Hello,

Thank you for the precision. After a second read on the setting you say you are using, I am curious to know how exactly you managed to compile the test for ZCU106 ?

For the timing, you can find in the report I pointed the exact kernel execution time (latency x clock frequency). You can compare the measured execution time with it to get an idea of the overhead incurred by setting up the system.

ShanoToni commented 2 years ago

Hello,

Apologies for the delayed reply.

To get the test compiling I used the setup described in the docs. I used the sysroot, includes and libs from the petalinux project sysroot for the board. I also cross compiled the libsycl.so and libpi_opencl.so for aarch64 for running on the board's architecture. I encountered some unsupported functionality for aarch64 in sycl/source/detail/platform_util.cpp, which did not seem important for the tests so the #define body for aarch64 and ARM was temporarily removed for cross compiling the libraries.

I wanted to ask some follow up questions regarding the pipelining if possible.

If the xilinx backend can perform loop pipelining what is the purpose of adding the annotation? Also I wanted to ask for some more information regarding _ssdm_op_SpecDataflowPipeline? In what way does this function annotate the dataflow? Is it something recognised by the xilinx compiler to enable the dataflow optimization?

Thank you.

lforg37 commented 2 years ago

Hello,

If the xilinx backend can perform loop pipelining what is the purpose of adding the annotation?

Using the annotation you can force the backend to pipeline or forbid it to pipeline a code section, and avoid having to rely on the backend heuristic to choose if the code has to be pipelined or not.

Also I wanted to ask for some more information regarding _ssdm_op_SpecDataflowPipeline? In what way does this function annotate the dataflow? Is it something recognised by the xilinx compiler to enable the dataflow optimization?

I think it was a built-in recognized by the HLS backend. A PR (#132) is ongoing to change the annotation format for something similar to what is done for pipelining.

In order to help keeping issues tidy, can you check with the HLS report that the pipeline decoration was taken into account ? That way we can either correct the pipeline annotation or close the issue. Feel free to open a discussion for non-issue related topics.

keryell commented 2 years ago

This is quite interesting work! We have not tried SYCL on MPSoC for years because when we tried, all the software stack was too old for SYCL and it was a pain to get just some minimal stuff working. But now that the latest Ubuntu is supported https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/1413611532/Canonical+Ubuntu it looks like we should come back on it. It would be great to update our current SYCL recipe with your recipe, so other people can benefit from your experience! :-) At least you should try with latest Vitis 2021.1 and the latest XRT from master branch. https://github.com/triSYCL/sycl/pull/132 should land soon to improve the performance. The examples you looked at from https://github.com/triSYCL/triSYCL/blob/master/tests/pipeline are super old, targeting our first implementation with a lot of workarounds due to the limited implementation. Now we have switched to the oneAPI DPC++ SYCL device compiler, so you should look at the examples from https://github.com/triSYCL/sycl/tree/sycl/unified/master/sycl/test/on-device/xocc instead. Feel free to contribute examples too. :-) Thanks.