triSYCL / sycl

SYCL for Vitis: Experimental fusion of triSYCL with Intel SYCL oneAPI DPC++ up-streaming effort into Clang/LLVM
Other
107 stars 19 forks source link

Routing error, congestion #133

Closed gogo2 closed 2 years ago

gogo2 commented 3 years ago

Regarding the project that I've mention in #123.

We've finished writing all the kernels, so I decided to check if it still compiles for Alveo (I have 7 kernels and get 13 IPs generated). Unfortunately I get routing errors regarding congestion:

WARNING: [Route 35-3311] The design has high localized SLL routing demand.Router might not be able to find a suitable solution to route all SLR crossing nets.Use SSI placer directives or location constraints that reduce l$
[...]
ERROR: [Route 35-3339] The router is unable to resolve localized SLL routing demand. Use SSI placer directives or location constraints that reduce localized SLL routing congestion
ERROR: [Route 35-368] Router failed to resolve global congestion

Is there any way to deal with such issues from sycl level or, more likely, by tuning how the Vitis is invoked (sycl_vxx.py?) ?

I use the latest unified/next branch (c5821e3) and target xilinx_u250_gen3x16_xdma_3_1_202020_1 with SPIR flow. hw_emu compiles and works correctly.

keryell commented 3 years ago

It is not yet possible to handle this with direct SYCL decorations but at least it is possible to add specific Vitis options with this decoration https://github.com/triSYCL/sycl/blob/sycl/unified/master/sycl/test/on-device/xocc/simple_tests/kernel_argument_forwarding.cpp if you have an idea on how to solve this usually with Vitis (which I do not know). Otherwise, have you tried the HLS path instead of legacy SPIR? It is where we are focusing our efforts now.

gogo2 commented 3 years ago

it is possible to add specific Vitis options with this decoration [...] if you have an idea on how to solve this usually with Vitis (which I do not know).

That should be helpful, I'll give it some tries, thanks!

Otherwise, have you tried the HLS path instead of legacy SPIR? It is where we are focusing our efforts now.

Yep, I've tried to compile the project with fpga64_hls_hw target, but I get a bunch of Pointer to pointer is not supported errors during high-level synthesis, which is kind of strange...

Also, with simple vec_add example (which compiles correctly) I get runtime error:

[XRT] ERROR: ClEnqueueNDRangeKernel : CL_INVALID_WORK_GROUP_SIZE case 2
[...]
OpenCL 1.x and 2.0 requires to pass local size argument even if required work-group size was specified in the program source -54 (CL_INVALID_WORK_GROUP_SIZE)

(same shows with fpga64_hls_hw_emu).

gogo2 commented 3 years ago

Ok, I managed to pass additional arguments to the kernel compilation process, but:

sycl_vxx.py accepts --vitis_comp_argfile and --vitis_link_argfile arguments, but those files specified by the compiler, have randomized name, are in the /tmp directory and are always empty, so I assume that those options are not really used at the moment.

I would propose a solution where those configuration files are specified by the user through environment variables (and when the whole toolchain becomes mature enough this can also be done by CMake parameters).

I think that this also aligns better with SYCL as a "single-source heterogeneous programming model" than platform/vendor specific functions/decorations like sycl::xilinx::kernel_param.

I can contribute such a solution to the sycl_vxx.py script, if you want.

lforg37 commented 3 years ago

Thank you for your feedback !

For some reason (which is unclear to me), for 6 out of 7 kernels the compiler creates additional kernel with slightly different name, and as far as I can tell, additional arguments aren't passed to those kernels compilation

Are you using parallel_for to start your kernels ? This is not yet supported by our tool (as we have no support for duplication of the kernel as of now, there would be no difference between a parallel_for and a standard for nest.). As an alternative you can check #132 that provide native and templated loop unrolling. The native one needs the other changes of the PR to work, but the templated one should be usable as is.

It is also probably the reason of the runtime issue in vec_add using the HLS flow.

Yep, I've tried to compile the project with fpga64_hls_hw target, but I get a bunch of Pointer to pointer is not supported errors during high-level synthesis, which is kind of strange...

We didn't face this issue: can you provide a code sample for which it appears ?

sycl_vxx.py accepts --vitis_comp_argfile and --vitis_link_argfile arguments, but those files specified by the compiler, have randomized name, are in the /tmp directory and are always empty, so I assume that those options are not really used at the moment.

These files are populated when using a -Xsycl-target-backend or -Xsycl-target-linker option on the command line.

I would propose a solution where those configuration files are specified by the user through environment variables (and when the whole toolchain becomes mature enough this can also be done by CMake parameters). I think that this also aligns better with SYCL as a "single-source heterogeneous programming model" than platform/vendor specific functions/decorations like sycl::xilinx::kernel_param.

I think it is a good idea to have both, but I don't think we can replace all sycl::xilinx properties by configuration files. Indeed, having the annotation on the C++ side avoids the need to identify which llvm IR variable corresponds to which C++ variable.

I can contribute such a solution to the sycl_vxx.py script, if you want.

Feel free to open a PR on it.

gogo2 commented 3 years ago

Are you using parallel_for to start your kernels ?

Yes, I'm invoking parallel_for. Previously I had successful tests with parallel_for in terms of correct execution of the program in spite of this, but it's increasing usage of hardware resources, so it may also be the cause of my congestion issues...

We didn't face this issue: can you provide a code sample for which it appears?

I will prepare a sample. Can this also be caused by parallel_for usage?

These files are populated when using a -Xsycl-target-backend or -Xsycl-target-linker option on the command line.

Ok, so it isn't a dead code and I shouldn't mess up with --vitis_comp_argfile and --vitis_link_argfile options?

I think it is a good idea to have both, but I don't think we can replace all sycl::xilinx properties by configuration files. Indeed, having the annotation on the C++ side avoids the need to identify which llvm IR variable corresponds to which C++ variable.

Of course, I am not suggesting removing in-source option. It may be a better solution for many cases,

lforg37 commented 3 years ago

I will prepare a sample.

Thanks !

Can this also be caused by parallel_for usage?

Maybe, but I think it is more related to issue with memory allocation. The SPIR flow has a very aggressive inlining behaviour, so maybe the issue disappear with it due to the inlining and optimisation.

Ok, so it isn't a dead code and I shouldn't mess up with --vitis_comp_argfile and --vitis_link_argfile options?

No it is not dead code: you can pass arguments to v++ from the clang invocation using the -Xsycl-target-backend and -Xsycl-target-linker options.

gogo2 commented 2 years ago

Maybe, but I think it is more related to issue with memory allocation.

It seems that this issue was caused by parallel_for usage. I prepared for-loop-wrapped single_task versions of kernels and now all of them pass high-level synthesis step in HLS path. So no need for code sample :)

As you suggested, duplicate kernels where cause by parallel_for, so single_task version had significantly lower congestion and compiled successfully in SPIR as well as HLS paths. For each compilation I added only --optimize 2 option for the kernel compilation without other Vivado link properties.

However, I get wrong results from HLS version, so something somewhere isn't calculating correctly. I haven't yet discovered what exactly, but I'll try to look into it. I will also check if it isn't caused by --optimize flag.

Many thanks for your help. Really looking forward to your progress on the toolchain. I'm going to provide my tiny contribution(s) soon.

keryell commented 2 years ago

We definitely need to provide a parallel_for even with the HLS path. The problem you might hit with the parallel_for with the SPIR path is that by default the HLS compiler in OpenCL mode create kernels with 4096 work-items by default if I remember correctly and this might be bigger than what you need and in that case use too many resources, and thus the congestion.

keryell commented 2 years ago

Otherwise, along the congestion problem, 2 colleagues of mines mentioned:

for this particular problem, the placer will somewhat decide on which SLLs to use. Try placing with different directive. The placer need to be place_design -directive SSI_SpreadLogic_high.

High SLL utilization on a U250 via Vitis indicates you have a kernel straddling an SLR boundary. If you have multiple small kernels, assign them to SLRs (there's a top-level flag to assign by kernel). If you have one large kernel (doesn't fit in an SLR), you might need to hand-partition and pass an XDC file with the assignments.

On our side, we need to see how to provide this control at the SYCL level in a practical way.