mlc-ai / relax

Apache License 2.0
151 stars 77 forks source link

[Bug] [Unity] [Metascheduler] cannot tune relax linear/matmul with M > 1 for cuda #227

Open elvin-n opened 1 year ago

elvin-n commented 1 year ago

Unable to tune linear/matmul having M value bigger than 1.

The error message is different comparing to Unity branch and this fact causes me to submit this bug, since changes in mlc-ai relax affected this use case and seems should be fixed here as well, not only in Unity

import tvm
from tvm import meta_schedule as ms
from tvm.relay.backend import Executor
from tvm import relax
from tvm.relax.testing import nn

# -------- Func definition
class Linear(nn.Module):
    def __init__(self, in_features, out_features, dtype: str, bias=False):
        self.in_features = in_features
        self.out_features = out_features
        self.weight = nn.Parameter(
            (out_features, in_features), dtype=dtype, name="linear_weight"
        )
        if bias:
            self.bias = nn.Parameter((out_features,), dtype=dtype, name="linear_bias")
        else:
            self.bias = None

    def forward(self, input: relax.Expr) -> relax.Var:
        return nn.emit(relax.op.linear(input, self.weight, self.bias))

bb = relax.BlockBuilder()
seq_len = 4
with bb.function("func1"):
    model = Linear(2048, 768, "float16")
    input = nn.Placeholder((seq_len, 2048), dtype="float16", name="input")
    with bb.dataflow():
        res = model(input)
        params = [
            input,
        ] + model.parameters()
        gv = bb.emit_output((res,))
    bb.emit_func_output(gv, params)

mod = bb.get()
gv = mod.get_global_var("func1")
bb.update_func(gv, mod[gv].with_attr("func1", 1))

mod = relax.pipeline.get_pipeline()(mod)
mod = relax.transform.LiftTransformParams()(mod)

mod = tvm.tir.transform.ForceNarrowIndexToInt32()(mod)

# ------ Metascheduler starts here
database = None

strategy_name = "evolutionary"
name = f"relax_linear_{seq_len}_2048_2048_768"
work_dir = f"./{name}/"
module_equality_name = "ignore-ndarray"

target = tvm.target.Target("nvidia/geforce-rtx-2060", host="llvm")
executor = Executor("graph")
mod = mod.with_attr("executor", executor)
ndk_builder = ms.builder.LocalBuilder(timeout_sec=60)
evaluator_config=ms.runner.EvaluatorConfig(
    number=3,
    repeat=1,
    min_repeat_ms=100,
    enable_cpu_cache_flush=False,
)
ms_rpc_runner = ms.runner.LocalRunner(evaluator_config=evaluator_config,
            alloc_repeat=1,
        )
ms.relax_integration.tune_relax(
    mod=mod,
    target=target,
    params={},
    work_dir=work_dir,
    max_trials_global=1024,
    strategy=strategy_name,
    builder=ndk_builder,
    runner=ms_rpc_runner,
    module_equality=module_equality_name,
)
junrushao commented 1 year ago

Hey thanks for reporting! Would you mind elaborating what the M value is? Is it possible that it’s because the mixed usage of i32 and i64?

elvin-n commented 1 year ago

Would you mind elaborating what the M value is?

M is an input sequence length, for example. In case of dense it is batch size.

Is it possible that it’s because the mixed usage of i32 and i64?

Where dies it happen? If you refer to ForceNarrowIndexToInt32 transformation, then removing the transformation invocation does not affect behaviour

elvin-n commented 1 year ago

I found that tuning starts to work if I point seq_len = 32. In opposite to unity where tuning starts to work if I point this parameter to 16

elvin-n commented 1 year ago

One more fact - Metascheduler worked for M == 32 with commit c0e455773792e0f551f3ad279ea39886fcc525ce but for the latest commit 5b8db51e5eb8a56d9e59f427c23b057119f59df0 it cannot tune for any size of M