Closed tqchen closed 4 years ago
let us draft an RFC and post here this week
cc @Hzfengsy @spectrometerHBH @junrushao1994
TVM is an end-to-end deep learning compiler with two levels of IR and optimization. TVM translates popular DL frameworks into Relay and optimizes the computation graph, after which it lowers each graph node into Tensor Expression(TE) and does another function-level optimization before finally lowering it into TIR and generating backend code. In brief, the current workflow is
TF/PyTorch/ONNX -> Relay -> TE (based on TE schedule) -> TIR -> C++/CUDA
Currently, low-level optimization is done through TE scheduling, which has several limitations:
TensorIR is a brand new low-level IR with full scheduling support. Here are some key features and novel techniques.
To generalize the high-dimension tensor expression, we introduce a new statement structure Block. A block can wrap any part of the IR to provide isolation. A block is the minimal unit for scheduling and tensorization, which stores all the fundamental information of computation including block iteration vars, the region that the block reads and writes, the buffer which allocated inside the block, and the critical body statement. Block declare the expression computation with its iteration vars and types.
First of all, we will define the workload(gemm in the example) with hybrid. Note that the hybrid scipt direct generate a TIR program rather than TE stages. (We can auto-complete the loop nesting with block definition by default)
@tvm.hybrid.script
def matmul(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
C = tir.match_buffer(c, (1024, 1024), "float32")
A = tir.match_buffer(a, (1024, 1024), "float32")
B = tir.match_buffer(b, (1024, 1024), "float32")
reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0))
with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, vj, vk]:
reducer.step(C[vi, vj], A[vi, vk] * B[vk, vj])
Then create the schedule from TIR and tile the loops using primitives.
s = tir.create_schedule(matmul)
update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)
s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)
Result
def func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
# function attr dict
B = tir.match_buffer(b, [1024, 1024])
A = tir.match_buffer(a, [1024, 1024])
C = tir.match_buffer(c, [1024, 1024])
reducer = tir.comm_reducer(lambda x, y: x + y, tir.float32(0))
# body
for i0_outer, i1_outer, i2_outer, i2_inner, i0_inner, i1_inner in tir.grid(32, 32, 256, 4, 32, 32):
with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C") as [vi, vj, vk]:
tir.bind(vi, ((i0_outer*32) + i0_inner))
tir.bind(vj, ((i1_outer*32) + i1_inner))
tir.bind(vk, ((i2_outer*4) + i2_inner))
reducer.step(C[vi, vj], (A[vi, vk]*B[vk, vj]))
Vectorize and decompose reduction
s.vectorize(j_i)
s.decompose_reduction(update, j_o)
Result
@tvm.hybrid.script
def func(a: ty.handle, b: ty.handle, c: ty.handle) -> None:
# function attr dict
B = tir.match_buffer(b, [1024, 1024])
A = tir.match_buffer(a, [1024, 1024])
C = tir.match_buffer(c, [1024, 1024])
# body
for i0_outer in, i1_outer_init, i0_inner_init in tir.grid(32, 32, 32):
for i1_inner_init in range(0, 32, annotation = {"loop_type":"vectorize"}):
with tir.block([1024, 1024], "C_init") as [vi_init, vj_init]:
tir.bind(vi_init, ((i0_outer*32) + i0_inner_init))
tir.bind(vj_init, ((i1_outer_init*32) + i1_inner_init))
C[vi_init, vj_init] = tir.float32(0)
for i1_outer, i2_outer, i2_inner, i0_inner in tir.grid(32, 256, 4, 32):
for i1_inner in range(0, 32, annotation = {"loop_type":"vectorize"}):
with tir.block([1024, 1024, tir.reduce_axis(0, 1024)], "C_update") as [vi, vj, vk]:
tir.bind(vi, ((i0_outer*32) + i0_inner))
tir.bind(vj, ((i1_outer*32) + i1_inner))
tir.bind(vk, ((i2_outer*4) + i2_inner))
C[vi, vj] = (C[vi, vj] + (A[vi, vk]*B[vk, vj]))
Print it and run
build_func = tvm.build(s.func, target=target)
build_func(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), np.matmul(a.asnumpy(), b.asnumpy()), rtol=1e-5)
evaluator = build_func.time_evaluator(build_func.entry_name, ctx, number=1)
In this example, we are imperatively changing the IR rather than waiting until the end to change(TE). It is very important for users and developers to directly see what happens during the scheduling. Also, at every stage during the schedule, we can get a verifiable IR, that's the major improvement for both user experience and correctness proof.
Different from the TE schedule, TensorIR has a complete set of schedule algorithms, which does not need a schedule tree or any extra data structure. We will introduce a brand new set of schedule primitives and it has full backward compatibility for the TE schedule. We simplify the compiling workload and conception.
TF/PyTorch/ONNX -> Relay -> TIR -> schedule -> TIR -> scheudle -> TIR -> C++/CUDA
Now, there is no stage during the schedule. Rather than lowering the schedule into TIR, we directly mutate the TIR itself. Also, it enables the sequential schedule (schedule several times for a single workload).
TE has limited expressiveness since each stage is defined by Stage = te.compute(lambda expr)
, while TensorIR is a full c++-like IR. We can write any program with TensorIR as you want. Although not all programs can be scheduled, there are still more workloads that can be optimized by TensorIR.
One of the improved tasks is concatenating:
B = te.compute(i, tvm.tir.if_then_else(i < 10, A0[i], tvm.tir.if_then_else(i < 20, A1[i - 10], A2[i - 20])
with tir.block([10]) as vi:
B[vi] = A0[vi]
with tir.block([10]) as vi:
B[vi + 10] = A1[vi]
with tir.block([10]) as vi:
B[vi + 20] = A2[vi]
The critical improvement is performance. In TIR we optimize the program by deprecating the if
branch, which is impossible in the TE schedule.
Hardware accelerators led by GPUs are increasingly using hierarchical architecture, including memory hierarchy(global, shared, local/wmma in NV-GPU) and execution hierarchy(SM, warp, thread in NV-GPU). TVM defines the memory hierarchy and TensorIR provides corresponding native hierarchical block and execution scope. Different execution scope can access different memory scope.
TensorIR natively supports hierarchy checks. We will check the memory access and thread binding, including warp level instruction(wmma) validation during the schedule. Following is an example of the GPU hierarchy.
for bx in range(0, 32, annotation = {"loop_type":"blockIdx.x"}):
with block(exec_scope="gpu_block"):
for ty in range(0, 32, annotation = {"loop_type":"threadIdx.y"}):
with block(exec_scope="gpu_warp"):
for ty in range(0, 32, annotation = {"loop_type":"threadIdx.x"}):
with block(exec_scope="gpu_thread"):
A[i] = B[i] + 1
With more and more backend provides tensor operator and instruction, the limitation of the TE schedule shows off. It is hard to tensorize a complex stage. TensorIR chooses a block (a wrap of high-dimension computation) as the minimum schedule unit. We can natively tensorize a sub-program and even do schedule after tensorization.
Since every primitive directly rewrite the AST (it works like a StmtMutator
), we easily decouple primitives. Developers can add a schedule primitive as easy as add a pass. Also, it is easy to ensure the program has the same behavior between origin one and scheduled one by proving the correctness of each primitive and checking intermedia IR during the schedule.
Hybrid enables developers to write TensorIR through python syntax, one of the most popular languages. Also, it provides a way to store the IR after or during the schedule. Please see the detail by https://discuss.tvm.apache.org/t/rfc-hybrid-script-support-for-tir/7516
Let us change Halide => TE, since there are also primitives that halide does not support
It is the draft of the RFC. I'm not sure how detail should we write in the RFC. Do we need to talk about the design and data structure?
@Hzfengsy The current draft is a bit too abstract, and as a reader who do not have background they may still be confused. We do not need to talk about the design and data structure. As a first RFC it is more important to talk about "what it is"(from the users' perspective) and why
I would recommend instead add a few minimum examples(of scheduling on TIR and the input output) to show what do we mean by each of the points.
Also it might be useful to label all the key points, so others can refer to them by labels
I have updated the RFC, please take another look @tqchen @spectrometerHBH @junrushao1994
Overall LGTM. One point I would like to mention is that maybe we should stress why tensorize&auto-Tensorize leads to the design of TIR and how TIR helps solve tensorize&auto-Tensorize problem. The rest benefits are more like byproducts of this effort.
Let us only mention migration plan up to the enabling ansor on tensorIR.
I do think the example could be improved further. In particular, it would be great to see an example that:
So users see the overall flow of IR to IR transformation
Both Bohan and I have updated the RFC. cc @tqchen @junrushao1994
Would be good to print the result from a intermediate step. Eg what happens right after you run a single split. To make sure people get that it is a gradual change in IR.
To make things simpler, we can even use a vector add example.
It is important to show that we are imperatively changing the IR. Rather than wait until the end to change.
Buffer bind need to change to match buffer.
I think we can post after the above example changes(show Ir in each step of transformation, consider start with a vector add example)
Right now in the motivation section, our logic is “TE is not great at XXX”, which IMHO could be more explicit via this kind of statement:
Hmmm sorry I was late....
As we discussed last week, it might be helpful to start an RFC about the basic design of TensorIR in the cpmmunity, describing the overall project. This is also a good chance to provide insights/code examples about some transformation primitives.
Of course this would also mean that the TensorIR itself's design will become public, and we will need to be prepared for writing a clear paper and upstream in a timely fashion