Open fpgaminer opened 1 year ago
Yeah, on-chip indexing through shared memory isn't supported yet. It's on the roadmap though, but it's a pretty advanced feature so we haven't come up with a specific timeline yet.
Looks like we might see indexing support in the future
974:
Yeah, on-chip indexing through shared memory isn't supported yet. It's on the roadmap though, but it's a pretty advanced feature so we haven't come up with a specific timeline yet.
Looks like we might see indexing support in the future
Hello, I would like to confirm when this feature is expected to be supported?
bumping this
Advanced tensor indexing feature wanted!
When you divide the indices offs_k[:, None] // 8
you actually end-up with interleaved indices. Loading is pretty fast with this approach on some devices like ADA gpus like the 4090 / A6000 Ada, but I noticed loading is pretty slow on the A100 / H100.
Reading the small chunk and interleaving aka "repeat_interleave" with something like this is actually even worse:
b = tl.load(b_ptrs).trans()
b = tl.interleave(b, b)
b = tl.interleave(b, b)
b = tl.interleave(b, b).trans()
I reported a similar issue here: https://github.com/triton-lang/triton/issues/4906
I'm working on a Triton kernel to compute matmuls on quantized linear layers. In particular where there are more than one parameters packed into a single value of an int32 Tensor.
The issue is that I could not find a way to "unpack" such Tensors in Triton. For example, imagine I have an int32 Tensor of size [1, N//8], where each int32 represents eight 4-bit parameters. Inside a Triton kernel how do I expand this into a [1, N] Tensor?
Something like PyTorch's
repeat_interleave
would work, as it would allow one to unroll the packed tensor. From there one can apply shifting and masking to get the correct values unpacked at each index.My current hack is the following:
This is based on the matmul tutorial code. The major difference is that I divide the b_ptrs indexes by
// 8
. This causes them to repeat along the K axis. So I'm basically makingtl.load
act likerepeat_interleave
for me. Then I can finish unpacking the values like normal.The downside is that, as far as I'm aware, this results in 8x as many loads as compared to fetching the packed Tensor directly which is 8x smaller.
Having a built-in similar to
repeat_interleave
would allow me to unpack those values in SRAM and save the bandwidth. Or maybe a way to index a Tensor? Then I could build an interleaved index and dob[indexes]
. But I didn't see any examples of indexing Tensors like that, so I assumed it wasn't possible in the language.Does this functionality already exist? Is there a better implementation? Or should this be a feature request?
Thank you!