I'm implementing a tensor product operation in Halide that involves gathering inputs and scattering the final output on a GPU. I'm aiming to optimize shared memory usage for better performance, but I'm encountering some challenges.
Load gather_weight into shared memory at the outer reduction loop (c1) in product.
Inside c1, gather_weight requires m1(32) x c0(16) = 512 elements.
Per GPU block, there are m1(32) x p1(32) GPU threads. Since c0 is independent of the p1 dimension (the y-axis of GPU threads), we can reuse gather_weight across threads if we load it into shared memory.
Ideal Scenario:
Shared Memory Allocation: Allocate only m1(32) x c0(16) = 512 elements.
Data Loading: Use only a subset of GPU threads, such as m1(32) x (p1/2)(16), to load gather_weight into shared memory.
Issue Encountered:
Current Observation:
The actual shared memory allocation is significantly larger than expected (e.g., 16384 elements).
Only the x-axis of GPU threads is used for loading gather_weight into shared memory.
Hypothesis:
Halide might not recognize that gather_weight can be reused across independent thread variables (p1), leading to a larger shared memory allocation (m1 x c0 x p1).
Attempted Solution:
I tried adjusting the schedule to bring gather_weight computation at output instead of product:
produce output:
gpu_block o<Default_GPU>:
gpu_thread m<Default_GPU>:
output(...) = ...
for r39:
gpu_block r39.r39<Default_GPU>:
gpu_block m.m<Default_GPU>:
produce GatherWeight:
for c.wc1:
gpu_thread c.wc0 in [0, 15]<Default_GPU>:
gpu_thread m.m0 in [0, 31]<Default_GPU>:
GatherWeight(...) = ...
consume GatherWeight:
gpu_thread r39.p1.p1 in [0, 31]<Default_GPU>:
gpu_thread m.m1.m1 in [0, 31]<Default_GPU>:
produce Product:
for p:
Product(...) = ...
for r28.c1:
for p:
for r28.c0 in [0, 15]:
Product(...) = ...
consume Product:
for r39.p1.p0 in [0, 3]:
output(...) = ...
let t160 = (maxPos + 127)/128
let t161 = (output.extent.0 + 31)/32
let t163 = output.min.1*output.stride.1
let t162 = (input.min.1*input.stride.1) + input.min.0
for (output.s1.r39$x, 0, weight.extent.2) {
let t166 = ((output.s1.r39$x - omap.min.1)*omap.stride.1) - omap.min.0
let t165 = ((output.s1.r39$x - imap.min.1)*imap.stride.1) - imap.min.0
let t164 = (output.s1.r39$x*weight.stride.2) + output.min.0
gpu_block<CUDA> (output.s1.r39$y.r39$y.block_id_y, 0, t160) {
gpu_block<CUDA> (output.s1.m.m.block_id_x, 0, t161) {
allocate GatherWeight.0[float32 * 2048] in GPUShared
gpu_thread<CUDA> (.thread_id_y, 0, 32) {
gpu_thread<CUDA> (.thread_id_x, 0, 32) {
allocate Product.0[float32 * 4] in Register
if (.thread_id_y < 16) {
produce GatherWeight {
let t143.s = (output.s1.m.m.block_id_x*32) + t164
let t167 = .thread_id_x + t143.s
for (GatherWeight.s0.c.wc1, 0, 4) {
let t158 = (GatherWeight.s0.c.wc1*16) + .thread_id_y
GatherWeight.0[(t158*32) + .thread_id_x] = weight[(t158*weight.stride.1) + t167]
}
}
}
gpu_thread_barrier(2)
consume GatherWeight {
produce Product {
let Product.s0.p.loop_extent.s = (maxPos - (output.s1.r39$y.r39$y.block_id_y*128)) - (.thread_id_y*4)
let t168 = min(Product.s0.p.loop_extent.s, 4)
for (Product.s0.p.rebased, 0, t168) {
Product.0[Product.s0.p.rebased] = 0.000000f
}
let t169 = min(Product.s0.p.loop_extent.s, 4)
let t170 = (((output.s1.r39$y.r39$y.block_id_y*32) + .thread_id_y)*4) + t165
for (Product.s1.r28$x.c1, 0, 4) {
let t148 = (Product.s1.r28$x.c1*16) - t162
let t171 = Product.s1.r28$x.c1*16
for (Product.s1.p.rebased, 0, t169) {
let t151 = Product.s1.p.rebased + t170
for (Product.s1.r28$x.c0, 0, 16) {
Product.0[Product.s1.p.rebased] = Product.0[Product.s1.p.rebased] + (input[((imap[t151]*input.stride.1) + t148) + Product.s1.r28$x.c0]*GatherWeight.0[((Product.s1.r28$x.c0 + t171)*32) + .thread_id_x])
}
}
}
}
consume Product {
let output.s1.r39$y.p1.p0.epilogue.s = maxPos - (((output.s1.r39$y.r39$y.block_id_y*32) + .thread_id_y)*4)
let t154.s = (output.s1.m.m.block_id_x*32) - t163
let t172 = max(min(output.s1.r39$y.p1.p0.epilogue.s, 4), 0)
let t174 = (((output.s1.r39$y.r39$y.block_id_y*32) + .thread_id_y)*4) + t166
let t173 = .thread_id_x + t154.s
for (output.s1.r39$y.p1.p0, 0, t172) {
let t111 = (omap[output.s1.r39$y.p1.p0 + t174]*output.stride.1) + t173
let t112 = Product.0[output.s1.r39$y.p1.p0]
atomic (output) {
output[t111] = output[t111] + t112
}
}
}
free Product.0
}
}
}
free GatherWeight.0
}
}
}
}
}
Result:
The loop nest now appears closer to the desired structure.
Only half of thread_id_y is involved in loading data into shared memory.
Remaining Issue:
The shared memory still allocates an entire m0 x c0 x c1 block, which is larger than necessary (m0 x c0).
Ideally, I want the shared memory loading to happen within the loop over c1, loading only a block of size m0 x c0 = 512 elements.
Is there a way to adjust the Halide schedule to achieve this shared memory usage? I think .compute_at(product, c1) is necessary at some point, but I don't know how to bring this shared memory loads inside c1 with my requirements. I feel I'm almost there, or is this type of loop nest what halide wasn't meant to designed for?
Hi,
I'm implementing a tensor product operation in Halide that involves gathering inputs and scattering the final output on a GPU. I'm aiming to optimize shared memory usage for better performance, but I'm encountering some challenges.
Here's a reproduce of my Halide Generator code:
Outcome :
Objective:
I want to achieve the following optimizations on the GPU:
Accumulate the
product
in a4x1
(p0
xm0
) register block.Load
gather_weight
into shared memory at the outer reduction loop (c1
) inproduct
.c1
,gather_weight
requiresm1(32) x c0(16) = 512
elements.m1(32) x p1(32)
GPU threads. Sincec0
is independent of thep1
dimension (the y-axis of GPU threads), we can reusegather_weight
across threads if we load it into shared memory.m1(32) x c0(16) = 512
elements.m1(32) x (p1/2)(16)
, to loadgather_weight
into shared memory.Issue Encountered:
Current Observation:
16384
elements).gather_weight
into shared memory.Hypothesis:
gather_weight
can be reused across independent thread variables (p1
), leading to a larger shared memory allocation (m1 x c0 x p1
).Attempted Solution:
I tried adjusting the schedule to bring
gather_weight
computation atoutput
instead ofproduct
:loop nest and conceptual stmt :
Result:
thread_id_y
is involved in loading data into shared memory.Remaining Issue:
m0 x c0 x c1
block, which is larger than necessary (m0 x c0
).c1
, loading only a block of sizem0 x c0 = 512
elements.Is there a way to adjust the Halide schedule to achieve this shared memory usage? I think
.compute_at(product, c1)
is necessary at some point, but I don't know how to bring this shared memory loads insidec1
with my requirements. I feel I'm almost there, or is this type of loop nest what halide wasn't meant to designed for?