In gemv, the vector will be frequently used. If the vector is small enough, I want to fix it to shared memory and share it among different warps. However, it seems that tl.load cannot accomplish this? Or are there any other tricks?
I don't think we support tl.load into shared memory. Shared memory currently is used by compiler passes, not directly by user. You can add evict policy to tl.load to try to make it persist in cache.
In gemv, the vector will be frequently used. If the vector is small enough, I want to fix it to shared memory and share it among different warps. However, it seems that tl.load cannot accomplish this? Or are there any other tricks?