tlc-pack / tvm-tensorir

Apache License 2.0
8 stars 0 forks source link

[TIR][Schedule] affine binding and more #432

Open spectrometerHBH opened 3 years ago

spectrometerHBH commented 3 years ago

Recently @Hzfengsy brought up a question regarding affine binding and related schedule primitives. After brief discussions, I put my thoughts here for further discussions.

Intro case

for i in 16:
    with tir.block([16]) as vi:
        tir.bind(vi, i)
        A[vi] = tir.max(B[vi], B[vi + 1])

The above code is a simplified pooling operator, and if we do following schedule transformations, we will get

# split
# cache_read
for i in 17:
    with tir.block([17]) as vi:
        B_shard[i] = B[i]

for io in 4:
    for ii in 4:
        with tir.block([16]) as vi:
            tir.bind(vi, io * 4 + ii)
            A[i] = tir.max(B_shared[i], B_shared[i + 1])

# compute_at
for io in 4:
    for ii in 5:
        with tir.block([17]) as vi:
            tir.bind(vi, io * 4 + ii)
            B_shared[vi] = B[vi]
    for ii in 4:
        with tir.block([16]) as vi:
            tir.bind(vi, io * 4 + ii)
            A[vi] = tir.max(B_shared[vi], B_shared[vi + 1])

Note that the cache read block's binding is not affine. But we may still want to

  1. parallel the outer loop io
  2. tensorize the cache read block which will bring problems to the current schedule transformation.

Affine binding and parallelization

A clear motivation for affine binding is shown below

for io in 4:
    for ii in 5:
        with tir.block([17]) as vi:
            tir.bind(vi, io * 4 + ii)
            C[vi] = B[vi]
            C[vi] = C[vi] + 1

The block above is indeed a complete block, but it is incorrect to parallel(io). The current parallel algorithm doesn't reject such cases. If we add C in its read buffers, then the complete block check will work. But I still think if it is worth discussing whether it is OK to do so.

Affine binding and reordering

for io in 4:
    for ii in 5:
        for k in 10:
            with tir.block([17, tir.reduce(10)]) as vi, vk:
                tir.bind(vi, io * 4 + ii)
                tir.bind(vk, k)
                with tir.init():
                    C[vi] = 0.0
                C[vi] += A[vi, vk]

It's incorrect to reorder k, io, ii.

Affine binding and blockization

for io in 4:
    for ii in 5:
        with tir.block([17]) as vi:
            tir.bind(vi, io * 4 + ii)
            B_shared[vi] = B[vi]

Previously, I implemented subspace division for affine bindings to do blockization, since we need to generate reasonable bindings for the result outer and inner blocks. If we face the above tensorization need, affine binding doesn't work anymore.

A somewhat ad-hoc fix I can come up with is to generalized subspace division to work under vi = affine(outer loops) + some_inner _loop.

Would be great to hear your opinions.

cc @tqchen @Hzfengsy @junrushao1994 @MasterJH5574 @jinhongyii @yzh119

tqchen commented 3 years ago

I agree that clarifying part of the iterator space as affine might be something we want to think a bit more

Hzfengsy commented 3 years ago

Some addtional thoughts on correctness.

Affine binding is not exact affine

# Cannot parallel i
for i in tir.grid(8):
    with tir.block([8]) as [vi]:
        tir.write(A[vi : vi + 2])
        A[vi] = vi
        A[vi + 1] = vi
# Cannot reorder i, j
for i, j in tir.grid(8, 8)
    with tir.block([8, 8]) as [vi, vj]:
        tir.bind(vi, i)
        tir.bind(vj, j)
        A[vi, vj] = B[vi - 1, vj + 1]
        B[vi, vj] = 1

Block isolation is not strong enough

We design Block and expect it to be a level of isolation. But it may not work so well. Here are some cases:

  1. Tensorcore intrinsic
    with tir.block([1, 1]):
    wmma.sync(A, B, C)

    Tensorcore is expected to be a warp-level operation. It requires 32-threadIdx working together. However, by looking at the block signature, we can not know the constraint. Information lacking may influence additional scheduling.

Things are not only happened in TensorCore. Most of the opaque intrinsics have their unique constraint.

  1. Another scalar case
    # Cannot reorder i, k
    for i, k in tir.grid(...):
    with tir.block([8, tir.reduce(8)]) as [vi, vk]:
        tir.bind(vi, i)
        tir.bind(vk, k)
        A[i] += B[i, k]
        C[i] = A[i]

Some thoughts

  1. Why do other block-box kernels (e.g., cublas, cudnn) work well? They can not be paralleled. There are also so many libraries that can not be run parallelly in one progress.
  2. Should we take responsibility for all of such corner cases? It's always a trade-off between correctness and schedule space. But I don't know where the mid-point is.
  3. What will happen during auto-tuning / auto-scheduling if we can not prove the correctness? Something terrible.
  4. Create a new mode that turns off some or all the checks while warning the users it allows dangerous behavior.
  5. Introduce some checks for iter type. Or at least we need to allow users to know what is data_par and do not use it without carefully thinking.
tqchen commented 3 years ago

Great point, one thing we need to keep in mind is our usecase. For example, it would be great to be able to have a clear block isolation for wmma because this seems to be the key need for our major usecase.

We can however, tighten up some of the corner cases as long as we have the things we need in the search space.