Currently TC-SpMM still uses wmma abstractions for tensorization, which is not flexible and uses too much shared memory resources. We can turn to use MMA intrinsics instead to directly load non-contiguous global memory into warp-level memory and bypass the abstraction of fragments.
Currently TC-SpMM still uses wmma abstractions for tensorization, which is not flexible and uses too much shared memory resources. We can turn to use MMA intrinsics instead to directly load non-contiguous global memory into warp-level memory and bypass the abstraction of fragments.