Open yuanming-hu opened 2 years ago
Extension: Add Warp size query and control. Warp level intrinsics exists in Vulkan and Metal, and on those platforms some devices use warp size different from 32, some devices even allow custom warp sizes. (subgroup size control & subgroup operations)
@bobcao3 Can't agree more! :-)
Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives. Reference: https://www.khronos.org/blog/vulkan-subgroup-tutorial
Would love to see this! Btw Metal has pretty good warp intrinsics support as well (they call it SIMD-group). See table 6.13 in https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf
One addition to this proposal: warp intrinsics is a great add-on, but in the meantime, we also need a design to formalize our parallelization strategy. Right now it's quite vague to users how a Taichi for iteration is mapped to a GPU thread (TLDR; it's backend-dependent..) I think we need to offer explicit spec on this (cc @strongoier).
Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks.
I want to take care of __ballot_sync
intrinsics!
Continuing discussions on @bobcao3's question:
Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks. We can hard code it to all for now, but due to the complexity in the scheduling and non-guranteed lock-step execution, using the right mask probably needs the compiler to figure out the whether there can be divergence or not (when there's divergence, we need to run int mask = match_any_sync(activemask(), data); to get the right mask) I think handing masks over to the user may make it significantly harder to code, while also breaking compatibility with non CUDA devices)
My opinion: I agree exposing masks can be extra trouble for users, and can harm portability. Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful? If not then maybe we should not expose masks.
I agree exposing masks can be extra trouble for users, and can harm portability.
Also vote for hiding the masks beneath Taichi's interface.
The masks are extremely troublesome and hard to understand especially in Taichi, as we have hidden a lot many parallelization details for elegant parallel programming. The prerequisite to expose mask is a set of more direct APIs to manipulate parallelization.
Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful?
Special stencil patterns covering specific near neighbors (star stencil etc.) might need special masks, but such optimizations can be handled internally in Taichi. We can also quickly add the mask APIs when needed.
According to the CUDA API, the masking behavior is really unexpected. If an active thread executing an instruction where it is not in the mask yields unexpected behavior, this the mask is only an convergence requirement. Now comes the tricky part, there's no explicit convergence requirement in CUDA, thus the mask must be queried everytime we've taken a branch. Using the ALL mask in divergent control flow can result in GPU hang, while using __activethread()
does not guarantee a reconvergence after branching. Thus we should definitely hide the mask, but it also seems quite tricky to implement masks internally. I would say we need to maintain an mask variable once we encountered an IfStmt
.
Mask in vector processing like AVX512 or RiscV Vectors are very different from CUDA.
I would like to take care of __shfl_xor_sync
intrinsics!
Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.
Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.
Good question. As long as nobody says "I'll take this task" and the issue has no assignee, you are safe to assume that nobody is working on it. Before you start coding, it would be nice to leave a comment "let me implement XXXX" so that people know you are working on it :-)
Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives.
Recent NV gpus (Ampere and later) also support reduce_sync
variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions
In particular, this slide on page 47 says __reduce_op_sync
warp intrinsics are faster than warp shuffle based implementation by 10x.
https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0
Recent NV gpus (Ampere and later) also support
reduce_sync
variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions In particular, this slide on page 47 says__reduce_op_sync
warp intrinsics are faster than warp shuffle based implementation by 10x. https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0
Wow, that sounds quite attractive. Thanks for pointing this out. We need to dispatch the code according to compute capability. One place to look at: https://github.com/taichi-dev/taichi/blob/d82ea9045792a1f14a04a03c0b9292bb4c7235c5/taichi/runtime/llvm/locked_task.h#L28
@qiao-bo Could you add this to the feature list and coordinate its development? Many thanks!
@yuanming-hu @masahi It turns out a bit difficult to support the new reduce warp intrinsics at this moment. For example, __reduce_add_sync (i32)
will need to be mapped to redux.sync.add.s32
. This new redux
keyword is only supported since LLVM13 (https://github.com/llvm/llvm-project/blob/release/13.x/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td). Also tried bypassing NVVM and just use ptx asm in our runtime, but then llvm10 wouldn't let us because of the ptx jit compilation.
The migration to LLVM 12 is on our roadmap. Nevertheless, it may still lack the support of this warp reduce ;). For the purpose of this issue, I suggest to move this feature proposal to another issue for later work. WDYT?
Sounds good - we probably need to postpone the implementation until we have LLVM >= 13.
(If someone insists on implementing that, he can also consider using inline PTX assembly.)
I will take care of __syncwarp
intrinsic.
I'll take care of __uni_sync
.
I will take care of
__syncwarp
intrinsic.
I'm working on match_all
.
Update: Since we are approaching v1.1.0 release, I would like to draw an intermediate summary on this issue.
Thanks to our contributors, the list of warp-level intrinsics has been fully implemented. The milestone has also been achieved, namely using the intrinsics to implement a parallel scan (https://github.com/taichi-dev/taichi_benchmark/blob/main/pbf/src/taichi/scan.py), thanks to @YuCrazing.
As the next step, the following related tasks are planned:
raw_kernel
supportIn the long term, we plan provide high-level primitives that are backend-agnostic, and are able to provide abstractions to CUDA warp intrinsics, Vulkan subgroup, Metal SIMD group, cpu vectorization, etc.
Since this issue is meant to address CUDA warp-level intrinsics, maybe we can use another issue to track the progress of the mentioned tasks?
Hi, I wanted to know if anyone is working on adding support for the subgroup*
operations listed above? I can add support for some of the simple ones (shuffle*
) but it'd be great if someone can look into the ballot ops (supported by GL_KHR_shader_subgroup_ballot
) as I'm not sure how to implement them (the return type is a uvec4
) and need to use them for a project.
Hi, I wanted to know if anyone is working on adding support for the
subgroup*
operations listed above? I can add support for some of the simple ones (shuffle*
) but it'd be great if someone can look into the ballot ops (supported byGL_KHR_shader_subgroup_ballot
) as I'm not sure how to implement them (the return type is auvec4
) and need to use them for a project.
Maybe you can use a structure similar to how TextureStmt returns vec4...
Maybe you can use a structure similar to how TextureStmt returns vec4...
Can you share the link to it? I can't find TextureStmt
while searching.
(For people who are familiar with CUDA/LLVM, this is a good starting issue. For most intrinsics, you will only need to write < 10 LoC to implement the API, and < 50 LoC to test it. Come join us! :-)
Intro
There has been an increasing Taichi user need for writing high-performance SIMT kernels. For these use cases, it is fine to sacrifice a certain level of portability.
Currently, when running on CUDA, Taichi already follows the SIMT execution model. However, it lacks support for warp-level and block-level intrinsics (e.g.,
__ballot_sync
and__syncthreads
) that are often needed in fancy SIMT kernels.Implementation plan
__syncthreads
and add explicit shared memory support. We may even consider TensorCore and ray-tracing intrinsics.List of CUDA warp-level intrinsic
We plan to implement all of the following warp-level intrinsics:
__all_sync
(should be namedall_nonzero
in our API to avoid conflict withall
in Python) (by @varinic, https://github.com/taichi-dev/taichi/pull/4718)__any_sync
(should be namedany_nonzero
to avoid conflict withany
in Python) (by @varinic, https://github.com/taichi-dev/taichi/pull/4719)__uni_sync
(should be namedunique
) (by @0xzhang, https://github.com/taichi-dev/taichi/pull/4927#event-6570449057)__ballot_sync
(by @Wimacs, https://github.com/taichi-dev/taichi/pull/4641)__shfl_sync (i32)
(by @varinic https://github.com/taichi-dev/taichi/pull/4717)__shfl_sync (f32)
(by @varinic https://github.com/taichi-dev/taichi/pull/4717)__shfl_up_sync (i32)
(by @YuCrazing, https://github.com/taichi-dev/taichi/pull/4632)__shfl_up_sync (f32)
(by @YuCrazing, https://github.com/taichi-dev/taichi/pull/4632)__shfl_down_sync (i32)
(by @yuanming-hu, https://github.com/taichi-dev/taichi/pull/4616)__shfl_down_sync (f32)
(by @caic99, https://github.com/taichi-dev/taichi/pull/4819)__shfl_xor_sync
(by @varinic, WIP, https://github.com/taichi-dev/taichi/pull/4642)__match_any_sync
(by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4921)__match_all_sync
(by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4961)__activemask
(by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4918)__syncwarp
(by @GaleSeLee, https://github.com/taichi-dev/taichi/pull/4917 )See here and CUDA doc for more details :-)
API
We may pick one of the following API formats, depending on whether warp-level and block-level intrinsics should be put under the same namespace:
ti.simt.X
, such asti.simt.ballot()
andti.simt.warp_sync()
ti.simt.warp.X
, such asti.simt.warp.ballot()
andti.simt.warp.sync()
Please let me know which one you guys prefer :-)
Example
Computing sum of all values in a warp using
shfl_down
:Steps and how we collaborate
InternalFuncCallExpression
andInternalFuncStmt
. One issue is that in the LLVM codegen the generated function takesRuntimeContext *
, which is not needed. We need to make that optional. (Update: this is done in https://github.com/taichi-dev/taichi/pull/4616)ti.simt.warp.X
.Currently we are at step 2. For everyone who wants to contribute to this, please take one single intrinsic function to implement in a PR. That would simplify review and testing.
Please leave a comment (e.g., "I'll take care of
ti.simt.wary.shfl
!") in this PR, so that other community members know that you are working on it and we avoid duplicated work.For example, if you wish to implement
ballot
, fill inhttps://github.com/taichi-dev/taichi/blob/84973201e488bfcce1fa980457fe74e9141cefb3/python/taichi/lang/simt.py#L20-L22
and
https://github.com/taichi-dev/taichi/blob/84973201e488bfcce1fa980457fe74e9141cefb3/tests/python/test_simt.py#L23-L26
An example PR: https://github.com/taichi-dev/taichi/pull/4632
What we already have
Scaffold code and
shfl_down_i32
I went ahead and implemented https://github.com/taichi-dev/taichi/pull/4616
LLVM -> NVVM -> PTX code path
We already have a bunch of functions that wrap most of these intrinsics: https://github.com/taichi-dev/taichi/blob/bee97d50335dd1038bd5e3de9d9385da56a0744f/taichi/llvm/llvm_context.cpp#L355-L369
Therefore, for most of the cases, with high probability, the intrinsics can be implemented simply in 3-4 lines of code (+ tests). We can just call these functions. For example,
https://github.com/taichi-dev/taichi/blob/22d189519ea1cbef965461618643a38784963bdd/python/taichi/lang/simt/warp.py#L81-L88
Milestone
Implement GPU parallel scan (prefix sum)? That would be very useful in particle simulations. Ideas are welcome!
Future steps: making Taichi (kind of) a superset of CUDA!
__syncthreads
,__threadfence
etc.ti.raw_kernel
, something that provides 1:1 mapping to a__global__
CUDA kernelAppendix: List of higher-level primitives (in Vulkan, Metal, etc. & implements as helpers in CUDA)
Some of these exist in CUDA directly, however the scope of execution (i.e. mask) is not involved, and
sync
behavior is guaranteed, therefore it can not be directly mapped 1:1 with CUDA, helper functions are needed. (Reference: https://www.youtube.com/watch?v=fP1Af0u097o where Nvidia talked about implementing these in the drivers)subgroupBarrier
Execution barriersubgroupMemoryBarrier
Memory fencesubgroupElect
Elect a single invocation as leader (very useful in atomic reduction)subgroupAll
subgroupAny
subgroupAllEqual
subgroupBroadcast
(might be tricky as theid
that is broadcasting from is compile time constant`)subgroupBroadcastFirst
(use the lowest id active invocation)ballot
options (GL_KHR_shader_subgroup_ballot
)subgroupAdd
subgroupMul
subgroupMin
subgroupMax
subgroupAnd
subgroupOr
subgroupXor
subgroupShuffle
subgroupShuffleXor
subgroupShuffleUp
subgroupShuffleDown