Closed tgaddair closed 5 months ago
cc @abcdabcd987
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
07a40b9
) 43.27% compared to head (2a32749
) 43.27%. Report is 1 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks!
@yzh119 Can you take a look?
When there is a large imbalance (>= 65 elements in the batch) in the size of two or more segments in a batch, it can lead to deadlocks in the
sgmv_shrink
kernel.The crux of the issue was that each grid block can execute a dynamic number of steps depending on the size of its segment
(s_end - s_start)
. However, during each step the block will callgrid.sync()
. If one block executes more steps than another, it will callgrid.sync()
a different number of times, leading to a deadlock.The solution presented here is to compute the max number of steps from the largest segment, and then call
grid.sync()
at the end of the kernel for the difference between the max steps and the current block's steps.Because the length of the
s
vector is generally very small (< batch size), the loop here should not introduce noticeable latency. However, it may be worth exploring more optimized solutions to this problem in a follow-up.Note that this issue only occurs when using cooperative groups.
Related: