[BYOC] Add CUTLASS backend

masahi commented 1 year ago

A part of https://github.com/tlc-pack/relax/issues/364

I did some refactoring on the Relay BYOC side to share as much codegen utilities as possible between Relay and Relax CUTLASS backends.

See the test case for the whole flow. Previously I wanted to make the BYOC story in Relax simplest by always requiring MergeCompositeFunctions to run, even if a backend does not benefit from receiving larger subgraphs. But I realized that using MergeCompositeFunctions for CUTLASS is a bad idea for several reasons. Since the output of FuseOpsByPattern alone is not quite ready for offloading, I added a new option in FuseOpsByPattern to allow doing an extra post-processing step on the created composite functions. This step basically turns

@R.function
def fused_relax_nn_conv2d_relax_nn_relu_dnnl(
    data2: R.Tensor((1, 64, 56, 56), dtype="float32"),
    weight12: R.Tensor((64, 64, 3, 3), dtype="float32"),
) -> R.Tensor((1, 64, 56, 56), dtype="float32"):
    R.func_attr({"Primitive": 1, "Composite": "dnnl.conv2d_relu"})
    with R.dataflow():
        lv: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.conv2d(
            data2,
            weight12,
        )
        gv2: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.relu(lv)
        R.output(gv2)
    return gv2

into

@R.function
def fused_relax_nn_conv2d_relax_nn_relu_dnnl(
    data1: R.Tensor((1, 64, 56, 56), dtype="float32"),
    weight11: R.Tensor((64, 64, 3, 3), dtype="float32"),
) -> R.Tensor((1, 64, 56, 56), dtype="float32"):
    R.func_attr(
        {"Codegen": "dnnl", "global_symbol": "fused_relax_nn_conv2d_relax_nn_relu_dnnl"}
    )

    @R.function
    def gv1(
        data2: R.Tensor((1, 64, 56, 56), dtype="float32"),
        weight12: R.Tensor((64, 64, 3, 3), dtype="float32"),
    ) -> R.Tensor((1, 64, 56, 56), dtype="float32"):
        R.func_attr({"Primitive": 1, "Composite": "dnnl.conv2d_relu"})
        with R.dataflow():
            lv: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.conv2d(
                data2,
                weight12,
            )
            gv2: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.relu(lv)
            R.output(gv2)
        return gv2

    gv11: R.Tensor((1, 64, 56, 56), dtype="float32") = gv1(data1, weight11)
    return gv11

This is a trivial transformation, but a backend expects the latter form only to be able to compile composite functions. We can make this step a standalone pass to be run after FuseOpsByPattern in case a backend doesn't want to run MergeCompositeFunctions. But I felt that to be overkill, so I updated FuseOpsByPattern instead.

cc @sunggg @psrivas2 @mbaret @gigiblender @mikepapadim

sunggg commented 1 year ago

Thank you, @masahi for brining CUTLASS! Before taking a deeper look, would you elaborate what you meant here?

But I realized that using MergeCompositeFunctions for CUTLASS is a bad idea for several reasons.

Ideally, it would be nice to have unified interface across the BYOC. So wondering if this is an issue we can fix.

masahi commented 1 year ago

There are two reasons:

By using MergeCompositeFunctions, we are requiring each backend to do memory management of intermediate tensors. For example, if we merge two subgraphs for conv2d -> relu, the signature of an extern function might look like
```
void func(DLTensor* data,
      DLTensor* weight1,
      DLTensor* weight2,
      DLTensor* out0)
```
and CUTLASS BYOC needs to allocate a tensor to hold the result of the first conv2d, which will also become the input to the second conv2d. Currently the CUTLASS BYOC doesn't allocate any memory (in most cases), so we would have to generate code that does that (and since it is a C-source based codegen, it would be very ugly...). Of course, if we naively allocate every intermediate tensor, it would be highly inefficient. If we don't use MergeCompositeFunctions, we can rely on the memory planner on the TVM side to do the same thing.
If we use MergeCompositeFunctions for CUTLASS, all kernels are generated into one, big source file. This would likely prevent parallel compilation by NVCC. So the compile time would become a lot worse. We can probably workaround this problem by some hack, but again, If we don't use MergeCompositeFunctions, this is not an issue.

masahi commented 1 year ago

Also, when I locally tried to run the test, I'm facing the following error

@sunggg Please use the copy of cutlass under 3rdparty (no need to install manually). If you set USE_CUTLASS, it should be automatically detected and used.

tqchen commented 1 year ago

Cutlass is closer to a library, in which case merging consecutive regions is less useful. Ideally we can rewrites into something like

@tvm.script.ir_module
class Module:
    @R.function
    def main(x, w0, w1):
       v0 = R.call_dps_packed("cutlass_gemm_relu", x, w0, R.Tensor((1024, 1024), "float32"))
       v1 = R.call_dps_packed("cutlass_gemm_relu", v0, w0, R.Tensor((1024, 1024), "float32"))

  // c source module
  void _GEMM(NDArray A, NDArray B, NDArray C) {
     ...
  }

TVM_DLL_EXPORT_TYPED_FUNC(cutlass_gemm_relu, _GEMM);

Where cutlass_gemm_relu is a library function as being exposed by the generated code.

This is also how relax BYOC might go one step further comparing to existing relay BYOC. I think this is what is being produced as part of run_codegen, where the source module is attached to the final runtime.Module.

To make things modularized, while also include ability to attach .o file to runtime.Module. So the BYOC attachment would do the following things:

Generate the customized module, optionally run build and turn that into a .o file
Use tvm.runtime.load_static_library to load it back(note that the function names is actually optional here)
Attach that mod as part of IRModule.attrs["external_mods"]

masahi commented 1 year ago

@tqchen Yes, we are doing everything that you mentioned, in exactly the same way.

Any thought on the issue discussed in https://github.com/tlc-pack/relax/pull/380#discussion_r1092786444?

masahi commented 1 year ago

Before merging, I want to investigate implications for potential API breakage introduced by cutlass version 3. The current implementation is based on v2.11 and aims to maximize code sharing between the Relay BYOC.

According to https://github.com/NVIDIA/cutlass/blob/master/media/docs/cutlass_3x_backwards_compatibility.md, the old API would continue to work. I'll try v3 with this PR today.

masahi commented 1 year ago

Before merging, I want to investigate implications for potential API breakage introduced by cutlass version 3

Fortunately, to be compatible with v3, we only have to enable c++17 when compiling kernels during tuning. Relay tests also continue to work (module known accuracy issues).

Updated the cutlass submodule hash to https://github.com/NVIDIA/cutlass/commit/add4ba622f1cdebc145d1df0e9620c3c84c00a52, the latest commit as of today.

tlc-pack / relax

[BYOC] Add CUTLASS backend #380