[TL] Adapt TL Hardware-aware Search Space with Roller

Though currently we provide a strategy for users to define a search space for given TL Kernel. But it's still be hard and complex to define a precise and efficient search space for dynamic shapes and operators for given OP and backend.

 def get_configs_sm80(self):
        num_stages = 2
        configs = [
            {
                'block_M': 128,
                'block_N': 256,
                'block_K': 32,
                'threads': 128
            },
            {
                'block_M': 256,
                'block_N': 128,
                'block_K': 32,
                'threads': 128
            },
            {
                'block_M': 128,
                'block_N': 128,
                'block_K': 32,
                'threads': 128
            },
        ]
        configs = [{**c, 'num_stages': num_stages} for c in configs]
        return configs

This pull request link search space up with our roller search space.

However, the Block Level TL can not fully utilize the schedule information, for example, TL only provide dedicated three Warp Scheduling Policy:

class GemmWarpPolicy:
    Square = 0
    FullRow = 1
    FullCol = 2

microsoft / BitBLAS

[TL] Adapt TL Hardware-aware Search Space with Roller #207