ZihengJiang commented 2 years ago

Motivation

To run a deep learning model, the mainstream approach in TVM is to do code generation with an auto-tuner, which make the performance optimization process easier for developers without domain knowledge. But, the auto-tuner approach also comes with some limitation:

Performance issue. The vendor library can utilize the latest hardware features to deliver hardware-native performance. Experienced engineers are also willing to hand-optimize kernels in order to squeeze the ultimate performance for the most important workloads.
Enablements on customized accelerators. Usually, accelerator vendors need to spend several months of development effort to modify and adapt the TVM to ingest, optimize and execute models on their accelerators. In the meantime, the handmade kernel library, which is often born with the accelerator design, appears easier to use.
The prolonged auto-tuning time. It usually takes several hours to tune all workloads in the ResNet50 model.

For these reasons, an automatic dispatching mechanism for multi-backend (code generation, vendor library, kernel written manually) is required in TVM, assuming that those approaches need to be coexisted in the short term.

Existing Approaches

There are already some efforts to bridge the gap between auto-tuner and other approaches in TVM:

CUDNN Integration. The CUDNN is integrated as opaque packed functions first, then a dispatching rule is implemented on the TE schedule level. When the target triple is specified as built with CUDNN, the code will be generated as a call to the cudnn packed functions.
BYOC (Bring Your Own CodeGen). The BYOC workflow implements: pattern matching, annotation, cost-based partitioning mechanism, to allow user to specify which part in the graph should be offloaded to the accelerator and how to do code generation for them. BYOC can handle a subgraph beyond a single kernel, so make it possible to do fusion, tuning for the subgraph.
- https://arxiv.org/abs/2105.03215
BOLT. The Bolt follows BYOC's approach but is designed for general GPU devices and uses CUTLASS, a templated designed vendor library, as the backend to provide hardware native performance.
- https://arxiv.org/abs/2110.15238#

Those approaches demonstrated their feasibility on different applications, but lacks a global mechanism to dispatch between different backends automatically for different devices. For example, although BOLT allows users to specify which part to be computed by cutlass, it is still more friendly to let the framework decide whether to use code generation or cutlass from a user's perspective.

Design Goals

In summary, there are two design goals:

D0: A declarative approach that can describe what pattern can be computed by different backends on the specific target.
- D0A: The patterns can be overlapped, for example, replacing a node or a subgraph.
- D0B: The pattern should be able to describe the shape and data type of a node, since some kernels are only capable of handling specific shape input.
D1: An automatic evaluation loop that can decide which backend will be used according to their performance.
Proposal

P1: Pattern Specification for Target

The core API would be register(backend, target, pattern, fcodegen).
backend specifies the name of backend, for example, "cudnn", "cutlass", "tensorrt", "autotir", etc. This name can be used as a key for priority configuration;
target specifies the targeting hardware, which can be "cuda -lib=cudnn", "arm-cpu -lib=ethos-n";
pattern specifies the dataflow pattern that can be converted to such backend, for kernel library like cudnn, the pattern can be only a single operator, for other stronger backend, the pattern can be a subgraph like conv2d + arbitrary_elementwise_operator. As we state in D0, the pattern needs to be able to represent a subgraph, also properties like specific shape/dtype.
fcodegen is a callback function, whose signature should be fcodegen(target, pattern) -> packedfunc. It means given the target and pattern, returning the optimal kernel with such backend. The callback design makes it possible to do kernel fusion and tuning in the fcodegen.

Here are some examples (target and pattern would be complicated data structure, I use string here for simplification):

register("cudnn", "cuda -lib=cudnn", "conv2d", lambda target, op: return packed_func("cudnn.conv2d"))
register("cutlass", "cuda -lib=cutlass", "conv2d+add+relu", fusion_and_tune_in_culass)

P2: Automatic Evaluation and Dispatching

With P1, we can specify many graph patterns that can be converted to some specific backends. The next question is, how to choose the best backend for the final model.

P2A: priority configuration for different backends

First, we can allow users to specify the priority for different backends. For example:

priority_config = {
  "cutlass": 3,
  "tensorrt": 2,
}

The framework will use high priority backend (if it is enabled on the target hardware) to replace patterns in the model first, then try low priority backend. This is also useful when we want to lower some pattern to accelerator forcefully.

P2B: evaluation for same priority backends

If there are multiple backends which hold the same priority, and some pattern can be converted to them both, the framework will evaluate the generated packed function, then choose the better one according to their performance.

Benefits from Relax's New Design

The proposal can benefit from features brought by Relax's new design:

Unify the abstraction for cross-layer optimizations
Incremental compilation

These two new features make it possible to lower some parts of the graph to a specific backend ( represented by a call_packed function), which we can still lower the other part with different backend like autotir.

Blocker

[ ] Customized Build
[ ] Pattern Language

cc @Laurawly @xqdan @comaniac @junrushao1994 @tqchen

junrushao commented 2 years ago

CC: @sunggg

comaniac commented 2 years ago

We adopted an approach similar to P1 in Meta and do have an approach similar to P2 in our roadmap, although we haven't dived into the detail plan of P2 due to no urgent requirements nor obvious performance gain.

One challenge for the pattern in P1 is variants. For example, we have a case TVM GeLU can only run with FP32 due to the accuracy issue of erf. As a result, the IR in AMP may look like:

%1 = matmul(...); // output FP16
%2 = cast(%1, "float32");
%3 = gelu(%2); // output FP32

The above IR cannot match the pattern specified in CUTLASS that only considers matmul+gelu, even CUTLASS GeLU does support FP16. As a result, when writing the CUTLASS pattern, the developer has to consider such cases and makes the corresponding implementations, which could be tedious and hard to be aware.

Other than that, the overall design looks good to me. As I mentioned, we have a very similar design already implemented in Meta.

ZihengJiang commented 2 years ago

@comaniac Thanks for the comment!

For the AMP case, I think it is might be addressable by enhancing the pattern matcher, to make it support more general pattern like conv + arbitrary number of element-wise operators just like regex.

Glad to hear that many parties are adopting the same approach. I also just learnt that @sunggg have explored some similar approach before. I do think it is a good opportunity to have a converged global solution in relax with everyone's effort together.

sunggg commented 2 years ago

Hi, @ZihengJiang. I'm also glad to see that we are looking into the same direction. My collaborators and I developed an automated solution to handle such problems and observed some decent improvement compared to conventional approaches. It would be great if we can investigate this direction with relax together!

YuchenJin commented 2 years ago

Thanks @ZihengJiang! We proposed a minimum build pipeline design https://github.com/tlc-pack/relax/issues/49. This minimum build could enable flexible and customized compilation pipelines, and help explore ideas as you suggested in P2B.

For example, users can write feedback for loops in python to explore search-based customized optimization, which requires measuring the execution cost of candidates during the exploration. The custom rewrite pass is outside of the minimum build, so users can write a python script as below:


# mod: input relax program (computational graph)
def search_based_opt(mod):
    while search_budget > 0 and not converge:
        new_mod = relax.transform.SomeCustomRewritePass()(mod)

        rt_mod = tvm.build(new_mod, target)
        vm = relax.VirtualMachine(rt_mod, tvm.cpu())
        execution_time = measure_perf(vm.run())
    return best_candidate

xqdan commented 2 years ago

Thanks @ZihengJiang! The proposal is helpful for us, with it, it's possible to do graph heterogeneous planning and execution among cpu, gpu and npu. BTW some inputs of our NPU requirement, we need to do buffer fusion for very large subgraph as much as possible, which means for our NPU, we need more flexible for the whole graph compalition. The design here provide a fine grain heterogeneous solution, do we plan to have a coarser grain one?

ZihengJiang commented 2 years ago

Hi @sunggg and @YuchenJin

Thanks for the comments. I did not come up with a concrete proposal about the automatic searching part before, and here is the proposal for this part after I think more based on your comments.

The general question can be expressed as:

Given the constraints (the supported patterns by different backends considering priority), how can we find the optimal graph partition strategy to achieve the best performance for the whole graph.

There are two design points worth considering in terms of the compilation infrastructure:

Decouple the partition strategy with the real lowering. This means we will generate the partition strategy first, based on the constraints, then execute the lowering pipeline to replace part of the graph with packed function step by step.
Search based partition strategy generation. This means the partition strategy could be generated according to the feedback of the evaluation results of the whole graph. Different search algorithms can be applied here to accelerate this procedure.

The pseudo code looks like the for-loop proposed by Yuchen, but one unclear decision is whether we should put TensorRTRewrite, CUTLASSRewrite, those things in the normal build pipeline.

ZihengJiang commented 2 years ago

Some thoughts:

the fcodegen might not need to be restricted to return a packed function. We can change it to returning a CallNode instead, so that we can use a call_dps to call packed functions while being more general to represent more graph transformation pattern.

tqchen commented 2 years ago

I agree that overall it is useful to have a pattern language infrastructure that matches and enumerates the matched possibilities.

The particular codegen signature (e.g. fcodegen) could use a bit more thinking, e.g. ideally we want to preserve the IRModule=>IRModule property to make it composable(as a rewrite of Expr=>Expr), but still have proper followup benchmark and feedback information available to guide such a decision.

My guess is that we will end up want to enable a collection of different strategies. So focusing on the foundation infrastructure(pattern matching and enumeration, automation benchmarking of a subgraph decision) would be solid steps towards that which then can be used to quickly prototype strategies we talked abit

electriclilies commented 2 years ago

Hey Ziheng, thanks for putting up this proposal! I have a few questions:

Have you considered ways to make this more modular/extensible? It seems to me that we will end up with lots and lots of registrations, and I imagine that many of those registrations could be quite similar. For example, similar targets may have the same set of patterns. As we add more and more backends, I could see us ending up in a situation where we have hundreds of these registrations which are all slightly different and require lots of maintenance.
It is cool that the fcodegen function can do tuning itself. However, for the average user adding a new pattern, manually adding that tuning infrastructure to every fcodegen function would be super repetitive. What tuning infrastructure will you provide to make taking advantage of tuning easier?
Are you just going through the backends in the priority config in order and match those patterns? Do you have plans to make selecting what patterns to use smarter in the future?
Will you using the Relay pattern matcher to identify patterns? The Relay pattern matcher can’t do backtracking, so if you do want to use search to identify what to partition in the future, you will either have to extend the pattern matcher or implement a different algorithm to do matching. @mbs-octoml, @mikepapadim, @sunggg and I discussed the feasibility of extending the pattern matcher to do backtracking for the purpose of doing operator fusion, and we concluded that it would probably be easier (and more maintainable) to write a standalone algorithm because we'd have to over specialize the pattern matcher to the behavior of the operator fuser. (Happy to expand on this more if you want, but this response is already getting a bit long).

Also, just curious, could you explain a bit about the motivation for doing this in Relax instead of in main? To me, your proposal seems not very different from what we have in main today in its method (but the implementation is different), so I'm wondering if we could make the change in main. Additionally, I'm sure our collaborators at ARM would like to be involved in the discussion of this design, and that can better happen in main (my understanding is that they also have an interest in operator fusion). Thanks!

ZihengJiang commented 2 years ago

Hey @electriclilies . Thanks for the comment!

For Q1, I would expect that the support pattern is limited, as we saw in the contrib module(https://github.com/apache/tvm/tree/main/src/runtime/contrib) and external codegen module (https://github.com/apache/tvm/tree/main/src/relay/backend/contrib) in current tvm. We can organize them in separate folders and reduce repetitive code by functions.

For Q2, the need of tuning come up in some backends in real deployment, for example, in the cutlass BYOC, some parameters in the cutlass kernel are also tunable, and they usually have their own internal tuning function in the framework. We can also consider making our tuner part more general so that can be reused by other project easily.

For Q3, yes, we definitely want to make this part automatically and smart. You can refer to the discussion above as @sunggg and @YuchenJin has raised. But our first goal is to build the underlying infra as a playground that we can experiment different search algorithm easily in the future.

For Q4, yes I will reuse relay's pattern matcher. And would love to discuss with you more when we have some progress!

I prefer to do this on the relax development branch since: 1. we can utilize new features in relax's design like "unify the abstraction for cross-layer optimizations" and "incremental compilation", and support for dynamic shape. 2. On the other hand, this part also can inspire us how to improve relax's design, which is easier to update than the upstream.

tlc-pack / relax

[DISCUSS] Multi-backend Dispatching in Relax #46

Motivation

Existing Approaches

Design Goals

Proposal

P1: Pattern Specification for Target

P2: Automatic Evaluation and Dispatching

P2A: priority configuration for different backends

P2B: evaluation for same priority backends

Benefits from Relax's New Design

Blocker