CUDA cooperative groups support

Hi,

I was trying to get one of my favourite parts of the CUDA toolkit, the cooperative groups API, to work with circle the other day. This issue documents the errors I ran into and provisional CUDA-header-patching workarounds for them, as well as a few suggestions. It ended up being quite long, but I hope all the information I crammed in here comes across as helpful, which was certainly the intention. I'm also slightly wary of posting here, given the low number of replies, but it still seems like the best option, so here goes.

A lot of issues pop up when, say, compiling CUDA samples that use cooperative groups. Most of them are due to circle, but a few big ones seem to be on NVidia's side and compilation also fails with nvcc's younger brother nvc++. I've submitted a bug report and opened a forum topic with NVidia regarding that.

The first issue to work around with circle is the fact that using CUDA 12.2 bundled with HPC SDK 23.7 causes the compilation of any .cu file to fail with the error message:

error: /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/include/sm_32_atomic_functions.hpp:94:43
redefinition of function long long atomicMin(long long*, long long)
<more-lines>

This is due to the removal of a check whether _NVHPC_CUDA is defined somewhere in the sm_32_atomic_functions.h CUDA header that was fine in a previous version of the toolkit and is the subject of my bug report to NVidia. It may be worked around without any adverse side effects by passing -D__SM_32_ATOMIC_FUNCTIONS_H__ to circle for every GPU compilation.

I'm using circle build 200 on Kali Linux running in WSL2 with gcc 12.3.0-5 and libstd++ 13.1.0-6 (somehow). I have a laptop GeForce RTX 2060 with SM level 7.5. I set the CUDA_PATH environment variable and used the following alias to compile CUDA samples:

alias mycirc='circle --std=c++20 -D__SM_32_ATOMIC_FUNCTIONS_H__ -I../../../Common --sm_75'

Here's a chronologically ordered list of issues I ran into when compiling CG-related CUDA samples:

Compiling binaryPartitionCG immediately fails with a segmentation fault. The following change somehow resolves that

diff --git b/include-old/cooperative_groups/details/reduce.h a/include/cooperative_groups/details/reduce.h
index 3c06df6..d483728 100755
--- b/include-old/cooperative_groups/details/reduce.h
+++ a/include/cooperative_groups/details/reduce.h
@@ -278,8 +278,9 @@ namespace details {
                 *warp_scratch_location =
                     details::reduce(warp, _CG_STL_NAMESPACE::forward<TyVal>(val), op);
         };
+            using SubwarpType = details::internal_thread_block_tile<num_warps, warpType>;
         auto inter_warp_lambda =
-                [&] (const details::internal_thread_block_tile<num_warps, warpType>& subwarp, TyRet* thread_scratch_location) {
+                [&] (const SubwarpType& subwarp, TyRet* thread_scratch_location) {
                 *thread_scratch_location =
                     details::reduce(subwarp, *thread_scratch_location, _CG_STL_NAMESPACE::forward<TyFn>(op));
         };

The next error message that pops up is
```
error: TyTrunc cooperative_groups::__v1::details::vec3_to_linear(dim3, dim3)
failure during overload resolution for function TyTrunc cooperative_groups::__v1::details::vec3_to_linear(dim3, dim3)
function declared at /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/helpers.h:77:34
/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/helpers.h:97:48
        return vec3_to_linear<unsigned int>(threadIdx, blockDim); 
                                           ^
error: /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/helpers.h:97:48
... included from /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups.h:57:10
... included from CudaExample.cu:60:10
cannot convert lvalue const __type_threadIdx to dim3
__type_threadIdx declared at GPU implicit declarations:15:1
dim3 declared at /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/vector_types.h:418:1
          return vec3_to_linear<unsigned int>(threadIdx, blockDim);
```
and a similar error message with __type_blockIdx instead of __type_threadIdx. vec3_to_linear expects the first param to be dim3. The CUDA coding manual prescribes the type of threadIdx as uint3 and nvcc sees it as such. dim3 has a non-explicit converting constructor from uint3, so passing threadIdx should work out of the box. However, circle sees its type as __type_threadIdx, which is distinct from uint3 (somehow overriding the definition in <device_launch_parameters.h>) but is implicitly convertible to uint3. Unfortunately, being implicitly convertible is not transitive and __type_threadIdx is not implicitly convertible to dim3.
A workaround that allowed me to continue was to add the following implicitly converting constructors to dim3 in <vector_types.h>:
```
#ifdef __circle_lang__
__host__ __device__ constexpr dim3(__type_threadIdx v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ constexpr dim3(__type_blockIdx v) : x(v.x), y(v.y), z(v.z) {}
#endif
```
A slightly more robust solution would be to add a conversion-to-dim3 operator to __type_threadIdx. But whenever someone writes a class with a converting constructor from uint3 and then wants to convert threadIdx to it, circle will break, so a different solution would be ideal.
Inspecting the output of strings circle hints at circle implementing the x, y, z members of __type_threadIdx as properties (neat!) that delegate to function calls. Would it be possible to implement "namespace-level properties" that delegate global-variable accesses to function calls and implement threadIdx as an actual uint3 that way (and similarly for the other built-in variables)? Just an idea.
With this, binaryPartitionCG compiles and gives the same output as when compiled with nvcc. Next, compiling reductionMultiBlockCG yields
```
ptxas fatal   : Unresolved extern function '__trap'
<more lines>
```
Looks like the __trap intrinsic is fully missing. Adding
```
#ifdef __circle_lang__
__host__ __device__ constexpr void __trap() {}
#endif
```
to the top of the sample, before the includes, fixed compilation. There seem to be many more functions in device_functions.h, where __trap is declared, whose implementations are missing, including some surprising ones like __expf and the __fsub family. The commented ones in this test program are some (but perhaps not all) of them.

Yet another reduction sample, compiled with

mycirc -c reduction.cpp -o reduction.circ.o
mycirc -c reduction_kernel.cu -o reduction_kernel.circ.o
circle "$CUDA_PATH/lib64/libcudart.so" reduction.circ.o reduction_kernel.circ.o -o c && ./c --kernel=9

first runs into a bunch of CUDA issues, resolved by the patch I posted here. An alternative to the patch is specifying -D_CG_USER_PROVIDED_SHARED_MEMORY. For SM level >= 8.0, this define has to be specified regardless when compiling with circle as one runs into the following error otherwise

ptxas /tmp/circle-tmp-dir-gUUrk9/reduction_kernel-compute-80.ptx, line 5883; error   : Feature '%reserved_smem_offset_1' requires PTX ISA .version 7.6 or later

After that, the program compiles, but we get
```
ptxas warning : Unresolved extern variable 'warpSize' in whole program compilation, ignoring extern qualifier
```
I'm assuming warpSize is seen as either 0 or some junk uninitialized value, because the program then crashes wildly with the message "Kernel execution failed : (700) an illegal memory access was encountered." But adding
```
#ifdef __circle_lang__
int const warpSize = 32;
#endif
```
before the includes in reduction_kernel.cu again fixes everything. It seems that warpSize is another internal symbol that circle needs to define.
An additional issue pops up when compiling something like cooperative_groups::this_thread_block() with CUDA 12.0 instead of 12.2.
```
error: /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/memory.h:64:14
... included from /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups.h:58:10
... included from CudaExample.cu:60:10
%s must be followed by an operand digit
      asm ("{\n\t" 
           ^
```
The full referenced assembler statement is
```
    asm ("{\n\t"
         " .reg .u32 %start;\n\t"
         " .reg .u64 %extended;\n\t"
         " mov.u32 %start, %%reserved_smem_offset_1;\n\t"
         " cvt.u64.u32 %extended, %start;\n\t"
         " cvta.shared.u64 %0, %extended;\n\t"
         "}"
         : "=" _CG_ASM_PTR_CONSTRAINT(ptr));
```
I'm no assembler expert, but I have the impression that circle expects to find GCC extended asm syntax in asm blocks while CUDA asm syntax, though not really documented, appears to be less restrictive. Namely, it seems to allow not escaping % characters when they're not followed by a digit or a single letter and a digit. The expressions %start and %extended in the referenced asm statement are like that and don't conform to the GCC syntax. After changing them in the header to %%start and %%extended, respectively, the code compiles and executes seemingly correctly with both circle and nvcc.
Rather than relaxing % escaping rules in circle too, it's probably much better to just recommend using a newer version of the CUDA toolkit such as 12.2 where % characters seem to be escaped more consistently.

If it's useful to anyone I'm attaching a diff of all the changes I had to make to the 12.2 CUDA headers to be able to use cooperative groups comfortably (for now). It extends the patch from here with circle-specific additions.

Cheers, Mat

seanbaxter / circle

CUDA cooperative groups support #185