tlc-pack / relax

Apache License 2.0
195 stars 58 forks source link

[DISCUSS] Multi-backend Dispatching in Relax #46

Open ZihengJiang opened 2 years ago

ZihengJiang commented 2 years ago

Motivation

To run a deep learning model, the mainstream approach in TVM is to do code generation with an auto-tuner, which make the performance optimization process easier for developers without domain knowledge. But, the auto-tuner approach also comes with some limitation:

For these reasons, an automatic dispatching mechanism for multi-backend (code generation, vendor library, kernel written manually) is required in TVM, assuming that those approaches need to be coexisted in the short term.

Existing Approaches

There are already some efforts to bridge the gap between auto-tuner and other approaches in TVM:

Those approaches demonstrated their feasibility on different applications, but lacks a global mechanism to dispatch between different backends automatically for different devices. For example, although BOLT allows users to specify which part to be computed by cutlass, it is still more friendly to let the framework decide whether to use code generation or cutlass from a user's perspective.

Design Goals

In summary, there are two design goals:

Here are some examples (target and pattern would be complicated data structure, I use string here for simplification):

P2: Automatic Evaluation and Dispatching

With P1, we can specify many graph patterns that can be converted to some specific backends. The next question is, how to choose the best backend for the final model.

P2A: priority configuration for different backends

First, we can allow users to specify the priority for different backends. For example:

priority_config = {
  "cutlass": 3,
  "tensorrt": 2,
}

The framework will use high priority backend (if it is enabled on the target hardware) to replace patterns in the model first, then try low priority backend. This is also useful when we want to lower some pattern to accelerator forcefully.

P2B: evaluation for same priority backends

If there are multiple backends which hold the same priority, and some pattern can be converted to them both, the framework will evaluate the generated packed function, then choose the better one according to their performance.

Benefits from Relax's New Design

The proposal can benefit from features brought by Relax's new design:

These two new features make it possible to lower some parts of the graph to a specific backend ( represented by a call_packed function), which we can still lower the other part with different backend like autotir.

Blocker

cc @Laurawly @xqdan @comaniac @junrushao1994 @tqchen

junrushao commented 2 years ago

CC: @sunggg

comaniac commented 2 years ago

We adopted an approach similar to P1 in Meta and do have an approach similar to P2 in our roadmap, although we haven't dived into the detail plan of P2 due to no urgent requirements nor obvious performance gain.

One challenge for the pattern in P1 is variants. For example, we have a case TVM GeLU can only run with FP32 due to the accuracy issue of erf. As a result, the IR in AMP may look like:

%1 = matmul(...); // output FP16
%2 = cast(%1, "float32");
%3 = gelu(%2); // output FP32

The above IR cannot match the pattern specified in CUTLASS that only considers matmul+gelu, even CUTLASS GeLU does support FP16. As a result, when writing the CUTLASS pattern, the developer has to consider such cases and makes the corresponding implementations, which could be tedious and hard to be aware.

Other than that, the overall design looks good to me. As I mentioned, we have a very similar design already implemented in Meta.

ZihengJiang commented 2 years ago

@comaniac Thanks for the comment!

For the AMP case, I think it is might be addressable by enhancing the pattern matcher, to make it support more general pattern like conv + arbitrary number of element-wise operators just like regex.

Glad to hear that many parties are adopting the same approach. I also just learnt that @sunggg have explored some similar approach before. I do think it is a good opportunity to have a converged global solution in relax with everyone's effort together.

sunggg commented 2 years ago

Hi, @ZihengJiang. I'm also glad to see that we are looking into the same direction. My collaborators and I developed an automated solution to handle such problems and observed some decent improvement compared to conventional approaches. It would be great if we can investigate this direction with relax together!

YuchenJin commented 2 years ago

Thanks @ZihengJiang! We proposed a minimum build pipeline design https://github.com/tlc-pack/relax/issues/49. This minimum build could enable flexible and customized compilation pipelines, and help explore ideas as you suggested in P2B.

For example, users can write feedback for loops in python to explore search-based customized optimization, which requires measuring the execution cost of candidates during the exploration. The custom rewrite pass is outside of the minimum build, so users can write a python script as below:


# mod: input relax program (computational graph)
def search_based_opt(mod):
    while search_budget > 0 and not converge:
        new_mod = relax.transform.SomeCustomRewritePass()(mod)

        rt_mod = tvm.build(new_mod, target)
        vm = relax.VirtualMachine(rt_mod, tvm.cpu())
        execution_time = measure_perf(vm.run())
    return best_candidate
xqdan commented 2 years ago

Thanks @ZihengJiang! The proposal is helpful for us, with it, it's possible to do graph heterogeneous planning and execution among cpu, gpu and npu. BTW some inputs of our NPU requirement, we need to do buffer fusion for very large subgraph as much as possible, which means for our NPU, we need more flexible for the whole graph compalition. The design here provide a fine grain heterogeneous solution, do we plan to have a coarser grain one?

ZihengJiang commented 2 years ago

Hi @sunggg and @YuchenJin

Thanks for the comments. I did not come up with a concrete proposal about the automatic searching part before, and here is the proposal for this part after I think more based on your comments.

The general question can be expressed as:

Given the constraints (the supported patterns by different backends considering priority), how can we find the optimal graph partition strategy to achieve the best performance for the whole graph.

There are two design points worth considering in terms of the compilation infrastructure:

The pseudo code looks like the for-loop proposed by Yuchen, but one unclear decision is whether we should put TensorRTRewrite, CUTLASSRewrite, those things in the normal build pipeline.

ZihengJiang commented 2 years ago

Some thoughts:

the fcodegen might not need to be restricted to return a packed function. We can change it to returning a CallNode instead, so that we can use a call_dps to call packed functions while being more general to represent more graph transformation pattern.

tqchen commented 2 years ago

I agree that overall it is useful to have a pattern language infrastructure that matches and enumerates the matched possibilities.

The particular codegen signature (e.g. fcodegen) could use a bit more thinking, e.g. ideally we want to preserve the IRModule=>IRModule property to make it composable(as a rewrite of Expr=>Expr), but still have proper followup benchmark and feedback information available to guide such a decision.

My guess is that we will end up want to enable a collection of different strategies. So focusing on the foundation infrastructure(pattern matching and enumeration, automation benchmarking of a subgraph decision) would be solid steps towards that which then can be used to quickly prototype strategies we talked abit

electriclilies commented 2 years ago

Hey Ziheng, thanks for putting up this proposal! I have a few questions:

  1. Have you considered ways to make this more modular/extensible? It seems to me that we will end up with lots and lots of registrations, and I imagine that many of those registrations could be quite similar. For example, similar targets may have the same set of patterns. As we add more and more backends, I could see us ending up in a situation where we have hundreds of these registrations which are all slightly different and require lots of maintenance.

  2. It is cool that the fcodegen function can do tuning itself. However, for the average user adding a new pattern, manually adding that tuning infrastructure to every fcodegen function would be super repetitive. What tuning infrastructure will you provide to make taking advantage of tuning easier?

  3. Are you just going through the backends in the priority config in order and match those patterns? Do you have plans to make selecting what patterns to use smarter in the future?

  4. Will you using the Relay pattern matcher to identify patterns? The Relay pattern matcher can’t do backtracking, so if you do want to use search to identify what to partition in the future, you will either have to extend the pattern matcher or implement a different algorithm to do matching. @mbs-octoml, @mikepapadim, @sunggg and I discussed the feasibility of extending the pattern matcher to do backtracking for the purpose of doing operator fusion, and we concluded that it would probably be easier (and more maintainable) to write a standalone algorithm because we'd have to over specialize the pattern matcher to the behavior of the operator fuser. (Happy to expand on this more if you want, but this response is already getting a bit long).

Also, just curious, could you explain a bit about the motivation for doing this in Relax instead of in main? To me, your proposal seems not very different from what we have in main today in its method (but the implementation is different), so I'm wondering if we could make the change in main. Additionally, I'm sure our collaborators at ARM would like to be involved in the discussion of this design, and that can better happen in main (my understanding is that they also have an interest in operator fusion). Thanks!

ZihengJiang commented 2 years ago

Hey @electriclilies . Thanks for the comment!

For Q1, I would expect that the support pattern is limited, as we saw in the contrib module(https://github.com/apache/tvm/tree/main/src/runtime/contrib) and external codegen module (https://github.com/apache/tvm/tree/main/src/relay/backend/contrib) in current tvm. We can organize them in separate folders and reduce repetitive code by functions.

For Q2, the need of tuning come up in some backends in real deployment, for example, in the cutlass BYOC, some parameters in the cutlass kernel are also tunable, and they usually have their own internal tuning function in the framework. We can also consider making our tuner part more general so that can be reused by other project easily.

For Q3, yes, we definitely want to make this part automatically and smart. You can refer to the discussion above as @sunggg and @YuchenJin has raised. But our first goal is to build the underlying infra as a playground that we can experiment different search algorithm easily in the future.

For Q4, yes I will reuse relay's pattern matcher. And would love to discuss with you more when we have some progress!

I prefer to do this on the relax development branch since: 1. we can utilize new features in relax's design like "unify the abstraction for cross-layer optimizations" and "incremental compilation", and support for dynamic shape. 2. On the other hand, this part also can inspire us how to improve relax's design, which is easier to update than the upstream.