tlc-pack / tvm-tensorir

Apache License 2.0
8 stars 0 forks source link

[ACTION] RFC About TensorIR in the upstream #107

Closed tqchen closed 4 years ago

tqchen commented 4 years ago

As we discussed last week, it might be helpful to start an RFC about the basic design of TensorIR in the cpmmunity, describing the overall project. This is also a good chance to provide insights/code examples about some transformation primitives.

Of course this would also mean that the TensorIR itself's design will become public, and we will need to be prepared for writing a clear paper and upstream in a timely fashion

tqchen commented 4 years ago

let us draft an RFC and post here this week

cc @Hzfengsy @spectrometerHBH @junrushao1994

Hzfengsy commented 4 years ago

Background and Motivation

TVM is an end-to-end deep learning compiler with two levels of IR and optimization. TVM translates popular DL frameworks into Relay and optimizes the computation graph, after which it lowers each graph node into Tensor Expression(TE) and does another function-level optimization before finally lowering it into TIR and generating backend code. In brief, the current workflow is

TF/PyTorch/ONNX -> Relay -> TE (based on TE schedule) -> TIR -> C++/CUDA

Currently, low-level optimization is done through TE scheduling, which has several limitations:

Introduction

TensorIR is a brand new low-level IR with full scheduling support. Here are some key features and novel techniques.

Core data structure: Block and Block Realize

To generalize the high-dimension tensor expression, we introduce a new statement structure Block. A block can wrap any part of the IR to provide isolation. A block is the minimal unit for scheduling and tensorization, which stores all the fundamental information of computation including block iteration vars, the region that the block reads and writes, the buffer which allocated inside the block, and the critical body statement. Block declare the expression computation with its iteration vars and types.

Workload Example

  1. First of all, we will define the workload(gemm in the example) with hybrid. Note that the hybrid scipt direct generate a TIR program rather than TE stages. (We can auto-complete the loop nesting with block definition by default)

    @tvm.hybrid.script
    def matmul(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    C = tir.match_buffer(c, (1024, 1024), "float32")
    A = tir.match_buffer(a, (1024, 1024), "float32")
    B = tir.match_buffer(b, (1024, 1024), "float32")
    reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0))
    
    with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, vj, vk]:
        reducer.step(C[vi, vj], A[vi, vk] * B[vk, vj])
  2. Then create the schedule from TIR and tile the loops using primitives.

    s = tir.create_schedule(matmul)
    update = s.get_block("C")
    i, j, k = s.get_axes(update)
    i_o, i_i = s.split(i, bn)
    j_o, j_i = s.split(j, bn)
    k_o, k_i = s.split(k, 4)
    s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)

    Result

    def func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    # function attr dict
    B = tir.match_buffer(b, [1024, 1024])
    A = tir.match_buffer(a, [1024, 1024])
    C = tir.match_buffer(c, [1024, 1024])
    reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0))
    # body
    for i0_outer, i1_outer, i2_outer, i2_inner, i0_inner, i1_inner in tir.grid(32, 32, 256, 4, 32, 32):
         with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, vj, vk]:
             tir.bind(vi, ((i0_outer*32) + i0_inner))
             tir.bind(vj, ((i1_outer*32) + i1_inner))
             tir.bind(vk, ((i2_outer*4) + i2_inner))
             reducer.step(C[vi, vj], (A[vi, vk]*B[vk, vj]))
  3. Vectorize and decompose reduction

    s.vectorize(j_i)
    s.decompose_reduction(update, j_o)

    Result

    @tvm.hybrid.script
    def func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
    # function attr dict
    B = tir.match_buffer(b, [1024, 1024])
    A = tir.match_buffer(a, [1024, 1024])
    C = tir.match_buffer(c, [1024, 1024])
    # body
    for i0_outer in, i1_outer_init, i0_inner_init in tir.grid(32, 32, 32):
        for i1_inner_init in range(0, 32, annotation = {"loop_type":"vectorize"}):
            with tir.block([1024, 1024], "C_init") as [vi_init, vj_init]:
                tir.bind(vi_init, ((i0_outer*32) + i0_inner_init))
                tir.bind(vj_init, ((i1_outer_init*32) + i1_inner_init))
                C[vi_init, vj_init] = tir.float32(0)
        for i1_outer, i2_outer, i2_inner, i0_inner in tir.grid(32, 256, 4, 32):
            for i1_inner in range(0, 32, annotation = {"loop_type":"vectorize"}):
                with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C_update") as [vi, vj, vk]:
                    tir.bind(vi, ((i0_outer*32) + i0_inner))
                    tir.bind(vj, ((i1_outer*32) + i1_inner))
                    tir.bind(vk, ((i2_outer*4) + i2_inner))
                    C[vi, vj] = (C[vi, vj] + (A[vi, vk]*B[vk, vj]))
  4. Print it and run

    build_func = tvm.build(s.func, target=target)
    build_func(a, b, c)
    tvm.testing.assert_allclose(c.asnumpy(), np.matmul(a.asnumpy(), b.asnumpy()), rtol=1e-5)
    evaluator = build_func.time_evaluator(build_func.entry_name, ctx, number=1)

In this example, we are imperatively changing the IR rather than waiting until the end to change(TE). It is very important for users and developers to directly see what happens during the scheduling. Also, at every stage during the schedule, we can get a verifiable IR, that's the major improvement for both user experience and correctness proof.

Key Features

Independent scheduling based on IR itself

Different from the TE schedule, TensorIR has a complete set of schedule algorithms, which does not need a schedule tree or any extra data structure. We will introduce a brand new set of schedule primitives and it has full backward compatibility for the TE schedule. We simplify the compiling workload and conception.

TF/PyTorch/ONNX -> Relay -> TIR -> schedule -> TIR -> scheudle -> TIR -> C++/CUDA

Now, there is no stage during the schedule. Rather than lowering the schedule into TIR, we directly mutate the TIR itself. Also, it enables the sequential schedule (schedule several times for a single workload).

Stronger Expressiveness and Optimization Ability

TE has limited expressiveness since each stage is defined by Stage = te.compute(lambda expr), while TensorIR is a full c++-like IR. We can write any program with TensorIR as you want. Although not all programs can be scheduled, there are still more workloads that can be optimized by TensorIR.

One of the improved tasks is concatenating:

TE

B = te.compute(i, tvm.tir.if_then_else(i < 10, A0[i], tvm.tir.if_then_else(i < 20, A1[i - 10], A2[i - 20])

TensorIR:

with tir.block([10]) as vi:
    B[vi] = A0[vi]
with tir.block([10]) as vi:
    B[vi + 10] = A1[vi]
with tir.block([10]) as vi:
    B[vi + 20] = A2[vi]

The critical improvement is performance. In TIR we optimize the program by deprecating the if branch, which is impossible in the TE schedule.

Memory and Execution Scope

Hardware accelerators led by GPUs are increasingly using hierarchical architecture, including memory hierarchy(global, shared, local/wmma in NV-GPU) and execution hierarchy(SM, warp, thread in NV-GPU). TVM defines the memory hierarchy and TensorIR provides corresponding native hierarchical block and execution scope. Different execution scope can access different memory scope.

TensorIR natively supports hierarchy checks. We will check the memory access and thread binding, including warp level instruction(wmma) validation during the schedule. Following is an example of the GPU hierarchy.

for bx in range(0, 32, annotation = {"loop_type":"blockIdx.x"}):
    with block(exec_scope="gpu_block"):
        for ty in range(0, 32, annotation = {"loop_type":"threadIdx.y"}):
            with block(exec_scope="gpu_warp"):
                for ty in range(0, 32, annotation = {"loop_type":"threadIdx.x"}):
                    with block(exec_scope="gpu_thread"):
                        A[i] = B[i] + 1

High-dimension Scheduling and Tensorization.

With more and more backend provides tensor operator and instruction, the limitation of the TE schedule shows off. It is hard to tensorize a complex stage. TensorIR chooses a block (a wrap of high-dimension computation) as the minimum schedule unit. We can natively tensorize a sub-program and even do schedule after tensorization.

Decouple Primitives and Correctness Proof

Since every primitive directly rewrite the AST (it works like a StmtMutator), we easily decouple primitives. Developers can add a schedule primitive as easy as add a pass. Also, it is easy to ensure the program has the same behavior between origin one and scheduled one by proving the correctness of each primitive and checking intermedia IR during the schedule.

Round-trip Python Hybrid Syntax (Already upstream)

Hybrid enables developers to write TensorIR through python syntax, one of the most popular languages. Also, it provides a way to store the IR after or during the schedule. Please see the detail by https://discuss.tvm.apache.org/t/rfc-hybrid-script-support-for-tir/7516

Migration Plan

  1. Upstream TensorIR data structure.
  2. Upstream TensorIR schedule primitive
  3. Support AutoTVM/Ansor on TensorIR
tqchen commented 4 years ago

Let us change Halide => TE, since there are also primitives that halide does not support

Hzfengsy commented 4 years ago

It is the draft of the RFC. I'm not sure how detail should we write in the RFC. Do we need to talk about the design and data structure?

tqchen commented 4 years ago

@Hzfengsy The current draft is a bit too abstract, and as a reader who do not have background they may still be confused. We do not need to talk about the design and data structure. As a first RFC it is more important to talk about "what it is"(from the users' perspective) and why

I would recommend instead add a few minimum examples(of scheduling on TIR and the input output) to show what do we mean by each of the points.

Also it might be useful to label all the key points, so others can refer to them by labels

Hzfengsy commented 4 years ago

I have updated the RFC, please take another look @tqchen @spectrometerHBH @junrushao1994

spectrometerHBH commented 4 years ago

Overall LGTM. One point I would like to mention is that maybe we should stress why tensorize&auto-Tensorize leads to the design of TIR and how TIR helps solve tensorize&auto-Tensorize problem. The rest benefits are more like byproducts of this effort.

tqchen commented 4 years ago

Let us only mention migration plan up to the enabling ansor on tensorIR.

I do think the example could be improved further. In particular, it would be great to see an example that:

So users see the overall flow of IR to IR transformation

Hzfengsy commented 4 years ago

Both Bohan and I have updated the RFC. cc @tqchen @junrushao1994

tqchen commented 4 years ago

Would be good to print the result from a intermediate step. Eg what happens right after you run a single split. To make sure people get that it is a gradual change in IR.

To make things simpler, we can even use a vector add example.

It is important to show that we are imperatively changing the IR. Rather than wait until the end to change.

tqchen commented 4 years ago

Buffer bind need to change to match buffer.

tqchen commented 4 years ago

I think we can post after the above example changes(show Ir in each step of transformation, consider start with a vector add example)

junrushao commented 4 years ago

Right now in the motivation section, our logic is “TE is not great at XXX”, which IMHO could be more explicit via this kind of statement:

junrushao commented 4 years ago

Hmmm sorry I was late....