Open mpayrits opened 10 months ago
Is there any progress with this? I have a C++/CUDA project which uses g++ and nvcc compilers. I use heavily grid synchronization this_grid.sync()
in the code. I was thinking about using circle in the project, but then found this thread...
Hi,
I was trying to get one of my favourite parts of the CUDA toolkit, the cooperative groups API, to work with circle the other day. This issue documents the errors I ran into and provisional CUDA-header-patching workarounds for them, as well as a few suggestions. It ended up being quite long, but I hope all the information I crammed in here comes across as helpful, which was certainly the intention. I'm also slightly wary of posting here, given the low number of replies, but it still seems like the best option, so here goes.
A lot of issues pop up when, say, compiling CUDA samples that use cooperative groups. Most of them are due to circle, but a few big ones seem to be on NVidia's side and compilation also fails with nvcc's younger brother nvc++. I've submitted a bug report and opened a forum topic with NVidia regarding that.
The first issue to work around with circle is the fact that using CUDA 12.2 bundled with HPC SDK 23.7 causes the compilation of any
.cu
file to fail with the error message:This is due to the removal of a check whether
_NVHPC_CUDA
is defined somewhere in thesm_32_atomic_functions.h
CUDA header that was fine in a previous version of the toolkit and is the subject of my bug report to NVidia. It may be worked around without any adverse side effects by passing-D__SM_32_ATOMIC_FUNCTIONS_H__
to circle for every GPU compilation.I'm using circle build 200 on Kali Linux running in WSL2 with gcc 12.3.0-5 and libstd++ 13.1.0-6 (somehow). I have a laptop GeForce RTX 2060 with SM level 7.5. I set the
CUDA_PATH
environment variable and used the following alias to compile CUDA samples:Here's a chronologically ordered list of issues I ran into when compiling CG-related CUDA samples:
Compiling binaryPartitionCG immediately fails with a segmentation fault. The following change somehow resolves that
The next error message that pops up is
and a similar error message with
__type_blockIdx
instead of__type_threadIdx
.vec3_to_linear
expects the first param to bedim3
. The CUDA coding manual prescribes the type ofthreadIdx
asuint3
and nvcc sees it as such.dim3
has a non-explicit converting constructor fromuint3
, so passingthreadIdx
should work out of the box. However, circle sees its type as__type_threadIdx
, which is distinct fromuint3
(somehow overriding the definition in<device_launch_parameters.h>
) but is implicitly convertible touint3
. Unfortunately, being implicitly convertible is not transitive and__type_threadIdx
is not implicitly convertible todim3
.A workaround that allowed me to continue was to add the following implicitly converting constructors to
dim3
in<vector_types.h>
:A slightly more robust solution would be to add a conversion-to-
dim3
operator to__type_threadIdx
. But whenever someone writes a class with a converting constructor fromuint3
and then wants to convertthreadIdx
to it, circle will break, so a different solution would be ideal.Inspecting the output of
strings circle
hints at circle implementing thex
,y
,z
members of__type_threadIdx
as properties (neat!) that delegate to function calls. Would it be possible to implement "namespace-level properties" that delegate global-variable accesses to function calls and implementthreadIdx
as an actualuint3
that way (and similarly for the other built-in variables)? Just an idea.With this,
binaryPartitionCG
compiles and gives the same output as when compiled with nvcc. Next, compiling reductionMultiBlockCG yieldsLooks like the
__trap
intrinsic is fully missing. Addingto the top of the sample, before the includes, fixed compilation. There seem to be many more functions in
device_functions.h
, where__trap
is declared, whose implementations are missing, including some surprising ones like__expf
and the__fsub
family. The commented ones in this test program are some (but perhaps not all) of them.Yet another reduction sample, compiled with
first runs into a bunch of CUDA issues, resolved by the patch I posted here. An alternative to the patch is specifying
-D_CG_USER_PROVIDED_SHARED_MEMORY
. For SM level>= 8.0
, this define has to be specified regardless when compiling with circle as one runs into the following error otherwiseAfter that, the program compiles, but we get
I'm assuming
warpSize
is seen as either 0 or some junk uninitialized value, because the program then crashes wildly with the message "Kernel execution failed : (700) an illegal memory access was encountered." But addingbefore the includes in
reduction_kernel.cu
again fixes everything. It seems thatwarpSize
is another internal symbol that circle needs to define.An additional issue pops up when compiling something like
cooperative_groups::this_thread_block()
with CUDA 12.0 instead of 12.2.The full referenced assembler statement is
I'm no assembler expert, but I have the impression that circle expects to find GCC extended asm syntax in
asm
blocks while CUDA asm syntax, though not really documented, appears to be less restrictive. Namely, it seems to allow not escaping%
characters when they're not followed by a digit or a single letter and a digit. The expressions%start
and%extended
in the referencedasm
statement are like that and don't conform to the GCC syntax. After changing them in the header to%%start
and%%extended
, respectively, the code compiles and executes seemingly correctly with both circle and nvcc.Rather than relaxing
%
escaping rules in circle too, it's probably much better to just recommend using a newer version of the CUDA toolkit such as 12.2 where%
characters seem to be escaped more consistently.If it's useful to anyone I'm attaching a diff of all the changes I had to make to the 12.2 CUDA headers to be able to use cooperative groups comfortably (for now). It extends the patch from here with circle-specific additions.
Cheers, Mat