spectrometerHBH commented 3 years ago

We start with the pipeline of TensorCore tensorization.

Currently, the TensorIntrin for fill_fragment is

@tvm.script.tir
def wmma_fill_desc(c: ty.handle) -> None:
    C = tir.match_buffer(c, (16, 16), "float32", align=128, offset_factor=256, scope="wmma.accumulator")
    with tir.block([16, 16], "root") as [vi, vj]:
        tir.bind(vi, 0)
        tir.bind(vj, 0)
        for i, j in tir.grid(16, 16):
            with tir.block([16, 16], "init") as [vii, vjj]:
                tir.bind(vii, vi + i)
                tir.bind(vjj, vj + j)
                C[vii, vjj] = tir.float32(0)

@tvm.script.tir
def wmma_fill_impl(c: ty.handle) -> None:
    C = tir.match_buffer(c, (16, 16), "float32", align=128, offset_factor=256, scope="wmma.accumulator")
    with tir.block([16, 16], "root") as [vi, vj]:
        tir.bind(vi, 0)
        tir.bind(vj, 0)
        tir.reads([])
        tir.writes(C[0: 16, 0: 16])
        tir.evaluate(tir.tvm_fill_fragment(C.data, 16, 16, 16, C.elem_offset // 256, tir.float32(0), dtype="handle"))

In fact, this wmma_fill_impl of TensorIntrin doesn't work for non-packed layout. The semantic of wmma_fill_desc is that given starting position [vi, vj], fill C[vi: vi+16, vj: vj+16] with 0. Hence a semantically equivalent wmma_fill_impl should be

@tvm.script.tir
def wmma_fill_impl(c: ty.handle) -> None:
    C = tir.match_buffer(c, (16, 16), "float32", align=128, offset_factor=256, scope="wmma.accumulator")
    with tir.block([16, 16], "root") as [vi, vj]:
        tir.bind(vi, 0)
        tir.bind(vj, 0)
        tir.reads([])
        tir.writes(C[vi: vi + 16, vj: vj + 16])
        tir.evaluate(tir.tvm_fill_fragment(C.data, 16, 16, 16, vi // 16 * C.shape[-1] // 16 + vj // 16, tir.float32(0), dtype="handle"))

Note that the 5th argument is the index of warp buffer. In the high-level programming model, we operate on 16x16 subregions of a complete large buffer. But in the low-level programming model, the compiler will cut these warp pieces into separate 16x16 warp buffers, hence we need an index to designate which piece we are operating on.

This brings trouble for narrowing. Narrowing will change the shape of C, hence will require recalculating the index argument of tir.tvm_fill_fragment.

The problem is that we don't know how to rewrite the expression vi // 16 * C.shape[-1] // 16 + vj // 16. Suppose the starting position of the narrowed buffer C' is [i0, j0]. The correct expression after rewrite should be ((vi - i0) // 16 * C'.shape[-1] // 16 + (vj - j0) // 16.

I propose two methods for this problem

M0. Using MatchSubRegion

130 proposes MatchSubregion, which aims to deal with the same problem, but it mainly handles the opaque access of buffer fields like `C.elem_offeset`. vi and vj are also variables that need to recalculate according to the new starting point.

We need to give the compiler hints on which part of the body will be affected by narrowing. We can use new tir Ops like tir.relative(vi, 0), tir.relative(vj, 1).

M1. Using new tir Ops

This method directly utilizes new Ops. We can do with two different ways

M1.1 Introduce tir.tile_index(vi, vj) to directly represent the whole expression.

M1.2 Similar with M0, we introduce tir.relative(vi, 0), but we use tir.get_shape_dim(C, dim=-1) for C.shape.

cc @tqchen @vinx13 @Hzfengsy @junrushao1994

spectrometerHBH commented 3 years ago

Also, I point out another problem here, which is related to check of tensorization.

In the above TensorCore case, the checks we want to do are

C0. The shape of wmma buffer after narrowing is divisible by 16. C1. The starting position of wmma operation is divisible by 16.

These two checks can ensure that when the compiler can successfully break the whole wmma buffer into 16x16 fragments.

I have no clear idea how to state these two checks in TensorIntrin, if we want general TensorIntrin support.

tqchen commented 3 years ago

In this particular case, there is a mapping from 2D index(vi, vj) into the one dimensional index space. it would be great if our schedule template do the mapping instead of relying on the tensorizer(since mapping the two dimensional into a single dimension is not something that hardware provide -- since we have one dimensional index).

// Use this layout for compact compute that works on tensorization
=> Ccache[floordiv(i,16)][floordiv(j, 16)][i%16][j%16]  => C[i][j]

vinx13 commented 3 years ago

Example we discussed today: To use the intrinsic llvm.amdgcn.mfma.f32.16x16x16f16 (A: half4, B: half4, acc: half4) -> half4, (this intrinsic computes 16x16x16 matmul using 64 threads) each thread need to load half4 from A and half4 from B. We will use tir.store and tir.load to perform vectorized load. Since tir.store/load uses flattened one-d access, we need to use MatchSubRegion for A and B (A and B are both buffer sub region of shape [4,]), such that in buffer flatten correct offset will be added.

junrushao commented 2 years ago

I assume it's done on mainline? @Hzfengsy

tlc-pack / tvm-tensorir

[DISCUSS] Narrowing with opaque buffer access #358

M0. Using MatchSubRegion

130 proposes MatchSubregion, which aims to deal with the same problem, but it mainly handles the opaque access of buffer fields like `C.elem_offeset`. vi and vj are also variables that need to recalculate according to the new starting point.

M1. Using new tir Ops

tlc-pack / tvm-tensorir

[DISCUSS] Narrowing with opaque buffer access #358

M0. Using MatchSubRegion

130 proposes MatchSubregion, which aims to deal with the same problem, but it mainly handles the opaque access of buffer fields like C.elem_offeset. vi and vj are also variables that need to recalculate according to the new starting point.

M1. Using new tir Ops

130 proposes MatchSubregion, which aims to deal with the same problem, but it mainly handles the opaque access of buffer fields like `C.elem_offeset`. vi and vj are also variables that need to recalculate according to the new starting point.