tlc-pack / relax

Apache License 2.0
193 stars 58 forks source link

[BYOC] Add CUTLASS backend #380

Closed masahi closed 1 year ago

masahi commented 1 year ago

A part of https://github.com/tlc-pack/relax/issues/364

I did some refactoring on the Relay BYOC side to share as much codegen utilities as possible between Relay and Relax CUTLASS backends.

See the test case for the whole flow. Previously I wanted to make the BYOC story in Relax simplest by always requiring MergeCompositeFunctions to run, even if a backend does not benefit from receiving larger subgraphs. But I realized that using MergeCompositeFunctions for CUTLASS is a bad idea for several reasons. Since the output of FuseOpsByPattern alone is not quite ready for offloading, I added a new option in FuseOpsByPattern to allow doing an extra post-processing step on the created composite functions. This step basically turns

@R.function
def fused_relax_nn_conv2d_relax_nn_relu_dnnl(
    data2: R.Tensor((1, 64, 56, 56), dtype="float32"),
    weight12: R.Tensor((64, 64, 3, 3), dtype="float32"),
) -> R.Tensor((1, 64, 56, 56), dtype="float32"):
    R.func_attr({"Primitive": 1, "Composite": "dnnl.conv2d_relu"})
    with R.dataflow():
        lv: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.conv2d(
            data2,
            weight12,
        )
        gv2: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.relu(lv)
        R.output(gv2)
    return gv2

into

@R.function
def fused_relax_nn_conv2d_relax_nn_relu_dnnl(
    data1: R.Tensor((1, 64, 56, 56), dtype="float32"),
    weight11: R.Tensor((64, 64, 3, 3), dtype="float32"),
) -> R.Tensor((1, 64, 56, 56), dtype="float32"):
    R.func_attr(
        {"Codegen": "dnnl", "global_symbol": "fused_relax_nn_conv2d_relax_nn_relu_dnnl"}
    )

    @R.function
    def gv1(
        data2: R.Tensor((1, 64, 56, 56), dtype="float32"),
        weight12: R.Tensor((64, 64, 3, 3), dtype="float32"),
    ) -> R.Tensor((1, 64, 56, 56), dtype="float32"):
        R.func_attr({"Primitive": 1, "Composite": "dnnl.conv2d_relu"})
        with R.dataflow():
            lv: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.conv2d(
                data2,
                weight12,
            )
            gv2: R.Tensor((1, 64, 56, 56), dtype="float32") = R.nn.relu(lv)
            R.output(gv2)
        return gv2

    gv11: R.Tensor((1, 64, 56, 56), dtype="float32") = gv1(data1, weight11)
    return gv11

This is a trivial transformation, but a backend expects the latter form only to be able to compile composite functions. We can make this step a standalone pass to be run after FuseOpsByPattern in case a backend doesn't want to run MergeCompositeFunctions. But I felt that to be overkill, so I updated FuseOpsByPattern instead.

cc @sunggg @psrivas2 @mbaret @gigiblender @mikepapadim

sunggg commented 1 year ago

Thank you, @masahi for brining CUTLASS! Before taking a deeper look, would you elaborate what you meant here?

But I realized that using MergeCompositeFunctions for CUTLASS is a bad idea for several reasons.

Ideally, it would be nice to have unified interface across the BYOC. So wondering if this is an issue we can fix.

masahi commented 1 year ago

There are two reasons:

masahi commented 1 year ago

Also, when I locally tried to run the test, I'm facing the following error

@sunggg Please use the copy of cutlass under 3rdparty (no need to install manually). If you set USE_CUTLASS, it should be automatically detected and used.

tqchen commented 1 year ago

Cutlass is closer to a library, in which case merging consecutive regions is less useful. Ideally we can rewrites into something like

@tvm.script.ir_module
class Module:
    @R.function
    def main(x, w0, w1):
       v0 = R.call_dps_packed("cutlass_gemm_relu", x, w0, R.Tensor((1024, 1024), "float32"))
       v1 = R.call_dps_packed("cutlass_gemm_relu", v0, w0, R.Tensor((1024, 1024), "float32"))
  // c source module
  void _GEMM(NDArray A, NDArray B, NDArray C) {
     ...
  }

TVM_DLL_EXPORT_TYPED_FUNC(cutlass_gemm_relu, _GEMM);

Where cutlass_gemm_relu is a library function as being exposed by the generated code.

This is also how relax BYOC might go one step further comparing to existing relay BYOC. I think this is what is being produced as part of run_codegen, where the source module is attached to the final runtime.Module.

To make things modularized, while also include ability to attach .o file to runtime.Module. So the BYOC attachment would do the following things:

masahi commented 1 year ago

@tqchen Yes, we are doing everything that you mentioned, in exactly the same way.

Any thought on the issue discussed in https://github.com/tlc-pack/relax/pull/380#discussion_r1092786444?

masahi commented 1 year ago

Before merging, I want to investigate implications for potential API breakage introduced by cutlass version 3. The current implementation is based on v2.11 and aims to maximize code sharing between the Relay BYOC.

According to https://github.com/NVIDIA/cutlass/blob/master/media/docs/cutlass_3x_backwards_compatibility.md, the old API would continue to work. I'll try v3 with this PR today.

masahi commented 1 year ago

Before merging, I want to investigate implications for potential API breakage introduced by cutlass version 3

Fortunately, to be compatible with v3, we only have to enable c++17 when compiling kernels during tuning. Relay tests also continue to work (module known accuracy issues).

Updated the cutlass submodule hash to https://github.com/NVIDIA/cutlass/commit/add4ba622f1cdebc145d1df0e9620c3c84c00a52, the latest commit as of today.