Closed masahi closed 1 year ago
Thank you, @masahi for brining CUTLASS! Before taking a deeper look, would you elaborate what you meant here?
But I realized that using MergeCompositeFunctions for CUTLASS is a bad idea for several reasons.
Ideally, it would be nice to have unified interface across the BYOC. So wondering if this is an issue we can fix.
There are two reasons:
By using MergeCompositeFunctions
, we are requiring each backend to do memory management of intermediate tensors. For example, if we merge two subgraphs for conv2d -> relu
, the signature of an extern function might look like
void func(DLTensor* data,
DLTensor* weight1,
DLTensor* weight2,
DLTensor* out0)
and CUTLASS BYOC needs to allocate a tensor to hold the result of the first conv2d, which will also become the input to the second conv2d. Currently the CUTLASS BYOC doesn't allocate any memory (in most cases), so we would have to generate code that does that (and since it is a C-source based codegen, it would be very ugly...). Of course, if we naively allocate every intermediate tensor, it would be highly inefficient. If we don't use MergeCompositeFunctions
, we can rely on the memory planner on the TVM side to do the same thing.
If we use MergeCompositeFunctions
for CUTLASS, all kernels are generated into one, big source file. This would likely prevent parallel compilation by NVCC. So the compile time would become a lot worse. We can probably workaround this problem by some hack, but again, If we don't use MergeCompositeFunctions
, this is not an issue.
Also, when I locally tried to run the test, I'm facing the following error
@sunggg Please use the copy of cutlass under 3rdparty
(no need to install manually). If you set USE_CUTLASS
, it should be automatically detected and used.
Cutlass is closer to a library, in which case merging consecutive regions is less useful. Ideally we can rewrites into something like
@tvm.script.ir_module
class Module:
@R.function
def main(x, w0, w1):
v0 = R.call_dps_packed("cutlass_gemm_relu", x, w0, R.Tensor((1024, 1024), "float32"))
v1 = R.call_dps_packed("cutlass_gemm_relu", v0, w0, R.Tensor((1024, 1024), "float32"))
// c source module
void _GEMM(NDArray A, NDArray B, NDArray C) {
...
}
TVM_DLL_EXPORT_TYPED_FUNC(cutlass_gemm_relu, _GEMM);
Where cutlass_gemm_relu
is a library function as being exposed by the generated code.
This is also how relax BYOC might go one step further comparing to existing relay BYOC. I think this is what is being produced as part of run_codegen
, where the source module is attached to the final runtime.Module.
To make things modularized, while also include ability to attach .o
file to runtime.Module. So the BYOC attachment would do the following things:
@tqchen Yes, we are doing everything that you mentioned, in exactly the same way.
Any thought on the issue discussed in https://github.com/tlc-pack/relax/pull/380#discussion_r1092786444?
Before merging, I want to investigate implications for potential API breakage introduced by cutlass version 3. The current implementation is based on v2.11 and aims to maximize code sharing between the Relay BYOC.
According to https://github.com/NVIDIA/cutlass/blob/master/media/docs/cutlass_3x_backwards_compatibility.md, the old API would continue to work. I'll try v3 with this PR today.
Before merging, I want to investigate implications for potential API breakage introduced by cutlass version 3
Fortunately, to be compatible with v3, we only have to enable c++17 when compiling kernels during tuning. Relay tests also continue to work (module known accuracy issues).
Updated the cutlass submodule hash to https://github.com/NVIDIA/cutlass/commit/add4ba622f1cdebc145d1df0e9620c3c84c00a52, the latest commit as of today.
A part of https://github.com/tlc-pack/relax/issues/364
I did some refactoring on the Relay BYOC side to share as much codegen utilities as possible between Relay and Relax CUTLASS backends.
See the test case for the whole flow. Previously I wanted to make the BYOC story in Relax simplest by always requiring
MergeCompositeFunctions
to run, even if a backend does not benefit from receiving larger subgraphs. But I realized that usingMergeCompositeFunctions
for CUTLASS is a bad idea for several reasons. Since the output ofFuseOpsByPattern
alone is not quite ready for offloading, I added a new option inFuseOpsByPattern
to allow doing an extra post-processing step on the created composite functions. This step basically turnsinto
This is a trivial transformation, but a backend expects the latter form only to be able to compile composite functions. We can make this step a standalone pass to be run after
FuseOpsByPattern
in case a backend doesn't want to runMergeCompositeFunctions
. But I felt that to be overkill, so I updatedFuseOpsByPattern
instead.cc @sunggg @psrivas2 @mbaret @gigiblender @mikepapadim