I've been working on bringing up BYOC infra in Relax, building on the work of @sunggg and the pattern matcher work from @ganler. The ultimate goal is to make relax.vm.build(mod, "cuda") just work without tuning and with reasonable out-of-the-box performance. Also it would be the first step toward performant dynamic-shape support.
[x] Add pass to merge neighboring calls to functions compiled for the same external backend into one function (similar to MergeCompilerRegion in Relay, necessary for TRT)
I've been working on bringing up BYOC infra in Relax, building on the work of @sunggg and the pattern matcher work from @ganler. The ultimate goal is to make
relax.vm.build(mod, "cuda")
just work without tuning and with reasonable out-of-the-box performance. Also it would be the first step toward performant dynamic-shape support.My branch is here and currently I have minimal test cases for offloading a simple subgraph to DNNL and CUTLASS. I'm going to start sending pieces from it from today. https://github.com/tlc-pack/relax/compare/relax...masahi:codegen-cutlass?expand=1
RunCodegen
pass to send all BYOC functions to the backend at once (rather than individually)MergeComposite
in Relay)Add pass to wrap and annotate the partitioned function for offloading(subsumed by https://github.com/tlc-pack/relax/pull/372)MergeCompilerRegion
in Relay, necessary for TRT)Future possibilities (time permitting)
@sunggg @YuchenJin @tqchen @junrushao