[BUG]: Caching issue when using autotune

jon-chuang commented 9 months ago

Bug description

When re-running autotune, the JIT system doesn't work properly

Resulting in error:

JIT session error: Symbols not found: [ _runtime_llcl_Runtime___init__kA6A6AkA6A6A6A6AoApA, _runtime_llcl_AsyncTaskGroup_add_task_runtime_llcl_AsyncTaskGroup_builtin_coroutine_Coroutine_NonekA6A6AkA6A6A6A6AoAkA6A6AkA6A6AmAsAkA6A6AkA6A6AcBeBpA, _runtime_llcl_Runtime___exit___runtime_llcl_RuntimekA6A6AkA6A6A6A6AoAkA6A6AkA6A6ApA ]
/__w/modular/modular/Kernels/mojo/builtin/_startup.mojo:76:1: error: no viable expansions found
/__w/modular/modular/Kernels/mojo/builtin/_startup.mojo:76:1: note:   call expansion failed - no concrete specializations
/__w/modular/modular/Kernels/mojo/builtin/_startup.mojo:32:1: note:     no viable expansions found
/__w/modular/modular/Kernels/mojo/builtin/_startup.mojo:45:18: note:       call expansion failed - no concrete specializations
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:561:1: note:         no viable expansions found
fn main() raises:
^
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:577:82: note:           call expansion failed - no concrete specializations
    bench[loop_reorder_swizzled_autotune, "Loop reordered (swizzled, autotune):"](python_gflops, numpy_gflops)
                                                                                 ^
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:490:1: note:             no viable expansions found
fn bench[
^
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:502:38: note:               call expansion failed - no concrete specializations
    let secs = benchmark.run[test_fn]().mean()
                                     ^
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:460:1: note:                 no viable expansions found
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:498:21: note:                   call expansion failed - no concrete specializations
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:506:1: note:                     no viable expansions found
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:537:34: note:                       call expansion failed - no concrete specializations
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:489:5: note:                         no viable expansions found
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:496:38: note:                           call expansion failed - no concrete specializations
/__w/modular/modular/Kernels/mojo/time/time.mojo:204:1: note:                             no viable expansions found
/__w/modular/modular/Kernels/mojo/time/time.mojo:219:9: note:                               call expansion failed - no concrete specializations
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:492:9: note:                                 no viable expansions found
/__w/modular/modular/Kernels/mojo/benchmark/benchmark.mojo:494:21: note:                                   call expansion failed - no concrete specializations
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:499:5: note:                                     no viable expansions found
    fn test_fn():
    ^
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:500:17: note:                                       call expansion failed - no concrete specializations
        _ = func(C, A, B)
                ^
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:431:1: note:                                         no viable expansions found
fn loop_reorder_swizzled_autotune(inout C: Matrix, A: Matrix, B: Matrix):
^
/home/jonch/Desktop/Programming/mlsys/mojo/examples/matmul.mojo:437:6: note:                                           call expansion failed - no concrete specializations
    ]()
     ^
/__w/modular/modular/Kernels/mojo/autotune/autotuning.mojo:101:1: note:                                             no viable expansions found
/__w/modular/modular/Kernels/mojo/autotune/autotuning.mojo:128:6: note:                                               Failed to materialize symbols: { (evaluateSpecializations, { _matmul_loop_reorder_swizzled_autotune_impl_matmul_Matrix_matmul_Matrix_matmul_Matrix__concretekA6A6AoAkA6A6AmAsAkA6A6AsAkA6A6ApA }) }
mojo: error: failed to run the pass manager

Getting it to work again is fairly non-deterministic

Steps to reproduce

matmul.mojo example with autotune

System information

- What OS did you do install Mojo on ? Ubuntu 22.04
- Provide version information for Mojo by pasting the output of `mojo -v` mojo 0.5.0 (6e50a738)
- Provide Modular CLI version by pasting the output of `modular -v` modular 0.2.1 (5144fffe)

Mogball commented 9 months ago

@jon-chuang do you have code that we can run to reproduce this issue?

jon-chuang commented 9 months ago

@Mogball I can't provide an exact repro now.

But I have the following quite lengthy repro which also crashes and may be related:

import benchmark
from memory import memset_zero, stack_allocation
from random import rand
from algorithm import vectorize, parallelize, vectorize_unroll
from algorithm import Static2DTileUnitFunc as Tile2DFunc
from python import Python
from tensor import Tensor
from utils.index import Index
from memory.buffer import NDBuffer
from autotune import autotune, search
from time import now

alias M = 512
alias N = 512
alias K = 4096
alias type = DType.float32

# Mojo has SIMD vector types, we can vectorize the Matmul code as follows.
alias nelts = simdwidthof[type]()  # The SIMD vector width.

struct Matrix:
    var data: DTypePointer[type]
    var rows: Int
    var cols: Int

    # Initialize zeroeing all values
    fn __init__(inout self, rows: Int, cols: Int):
        self.data = DTypePointer[type].alloc(rows * cols)
        memset_zero(self.data, rows * cols)
        self.rows = rows
        self.cols = cols

    # Initialize taking a pointer, don't set any elements
    fn __init__(
        inout self, rows: Int, cols: Int, data: DTypePointer[DType.float32]
    ):
        self.data = data
        self.rows = rows
        self.cols = cols

    ## Initialize with random values
    @staticmethod
    fn rand(rows: Int, cols: Int) -> Self:
        let data = DTypePointer[type].alloc(rows * cols)
        rand(data, rows * cols)
        return Self(rows, cols, data)

    fn __getitem__(self, y: Int, x: Int) -> Float32:
        return self.load[1](y, x)

    fn __setitem__(self, y: Int, x: Int, val: Float32):
        return self.store[1](y, x, val)

    fn load[nelts: Int](self, y: Int, x: Int) -> SIMD[DType.float32, nelts]:
        return self.data.simd_load[nelts](y * self.cols + x)

    fn store[nelts: Int](self, y: Int, x: Int, val: SIMD[DType.float32, nelts]):
        return self.data.simd_store[nelts](y * self.cols + x, val)

# Perform 2D tiling on the iteration space defined by end_x and end_y, parallelizing 
# over x and y, and iterating in an order that has better L3 cache locality
fn tile_parallel_swizzled[
    tiled_fn: Tile2DFunc, tile_x: Int, tile_y: Int
](end_x: Int, end_y: Int):
    # Note: this assumes that ends are multiples of the tiles.
    alias tile_outer = 8
    alias group_size = tile_outer * 4

    # L3 cache swizzling
    @parameter
    fn row(swizzled: Int):
        let group_id = swizzled // group_size
        let group_offset_x = (group_id * tile_outer) % (N // tile_y)
        let yo = (swizzled % group_size) // tile_outer
        let xo = group_offset_x + (swizzled % tile_outer)
        let y = tile_y * yo
        let x = tile_x * xo
        tiled_fn[tile_x, tile_y](x, y)

    parallelize[row]((end_y // tile_y * end_x // tile_x), M * 2)

# Reorder the loops, and do not fully unroll the loop over the reduction dimension
# Also utilize tile swizzling for better L3 cache locality.
fn loop_reorder_swizzled(inout C: Matrix, A: Matrix, B: Matrix):
    alias tile_k = 8
    alias tile_k_unroll = 8
    @parameter
    fn calc_tile[tile_j: Int, tile_i: Int](jo: Int, io: Int):
        # Allocate the tile of accumulators on the stack.
        var accumulators = Matrix(
            tile_i, tile_j, stack_allocation[tile_i * tile_j, DType.float32]()
        )

        for ko in range(0, A.cols, tile_k * tile_k_unroll):
            @parameter
            fn calc_tile_row[](i: Int):
                for i in range(0, tile_k):
                    @parameter
                    fn calc_tile_inner[k: Int]():
                        @parameter
                        fn calc_tile_cols[nelts: Int](j: Int):
                            accumulators.store[nelts](
                                i,
                                j,
                                accumulators.load[nelts](i, j)
                                + A[io + i, ko + k] * B.load[nelts](ko + k, jo + j),
                            )

                        vectorize_unroll[nelts, tile_j // nelts, calc_tile_cols](tile_j)

                    unroll[tile_k_unroll, calc_tile_inner]()

            for i in range(0, tile_i):
                calc_tile_row(i)

        # Copy the local tile to the output
        for i in range(tile_i):
            for j in range(tile_j):
                C[io + i, jo + j] = accumulators[i, j]

    alias tile_i = 32
    alias tile_j = nelts * 4
    tile_parallel_swizzled[calc_tile, tile_j, tile_i](C.cols, C.rows)

fn matmul_evaluator(funcs: Pointer[matmul_fn_sig_type], size: Int) -> Int:
    print("matmul_evaluator, number of candidates: ", size)

    let eval_begin: Int = now()

    # This size is picked at random, in real code we could use a real size
    # distribution here.
    # let M = 512
    # let N = 512
    # let K = 512
    print("Optimizing for size:", M, "x", N, "x", K)

    var best_idx: Int = -1
    var best_time: Int = -1

    alias eval_iterations = 10
    alias eval_samples = 10

    var C = Matrix(M, N)
    var A = Matrix(M, K)
    var B = Matrix(K, N)
    let Cptr = Pointer[Matrix].address_of(C).address
    let Aptr = Pointer[Matrix].address_of(A).address
    let Bptr = Pointer[Matrix].address_of(B).address

    # Find the function that's the fastest on the size we're optimizing for
    for f_idx in range(size):
        let func = funcs.load(f_idx)

        @always_inline
        @parameter
        fn wrapper():
            func(C, A, B)
        let cur_time = benchmark.run[wrapper](1, 100_000, 5, 10).mean["ns"]().to_int()

        if best_idx < 0:
            best_idx = f_idx
            best_time = cur_time
        if best_time > cur_time:
            best_idx = f_idx
            best_time = cur_time

    let eval_end: Int = now()
    # Prevent matrices from being destroyed before we finished benchmarking them.
    A.data.free()
    B.data.free()
    C.data.free()
    print("Time spent in matmul_evaluator, ms:", (eval_end - eval_begin) // 1000000)
    print("Best candidate idx:", best_idx)
    return best_idx

@adaptive
fn autotuned_impl(inout C: Matrix, A: Matrix, B: Matrix):
    alias tile_k = autotune(8)
    alias tile_k_unroll = autotune(8)
    @parameter
    fn calc_tile[tile_j: Int, tile_i: Int](jo: Int, io: Int):
        # Allocate the tile of accumulators on the stack.
        var accumulators = Matrix(
            tile_i, tile_j, stack_allocation[tile_i * tile_j, DType.float32]()
        )

        for ko in range(0, A.cols, tile_k * tile_k_unroll):
            @parameter
            fn calc_tile_row[](i: Int):
                for i in range(0, tile_k):
                    @parameter
                    fn calc_tile_inner[k: Int]():
                        @parameter
                        fn calc_tile_cols[nelts: Int](j: Int):
                            accumulators.store[nelts](
                                i,
                                j,
                                accumulators.load[nelts](i, j)
                                + A[io + i, ko + k] * B.load[nelts](ko + k, jo + j),
                            )

                        vectorize_unroll[nelts, tile_j // nelts, calc_tile_cols](tile_j)

                    unroll[tile_k_unroll, calc_tile_inner]()

            for i in range(0, tile_i):
                calc_tile_row(i)

        # Copy the local tile to the output
        for i in range(tile_i):
            for j in range(tile_j):
                C[io + i, jo + j] = accumulators[i, j]

    alias tile_i = autotune(32)
    alias tile_j = nelts * 4
    tile_parallel_swizzled[calc_tile, tile_j, tile_i](C.cols, C.rows)

alias matmul_fn_sig_type = fn(inout C: Matrix, A: Matrix, B: Matrix) -> None

fn autotuned(inout C: Matrix, A: Matrix, B: Matrix):
    alias best_impl: matmul_fn_sig_type
    search[
        matmul_fn_sig_type,
        VariadicList(autotuned_impl.__adaptive_set),
        matmul_evaluator -> best_impl
    ]()
    # Run the best candidate
    return best_impl(C, A, B)

@always_inline
fn bench[
    func: fn (inout Matrix, Matrix, Matrix) -> None, name: StringLiteral
]() raises:
    var A = Matrix.rand(M, K)
    var B = Matrix.rand(K, N)
    var C = Matrix(M, N)

    @always_inline
    @parameter
    fn test_fn():
        _ = func(C, A, B)

    let secs = benchmark.run[test_fn](10).mean()
    # Prevent the matrices from being freed before the benchmark run
    A.data.free()
    B.data.free()
    C.data.free()
    let gflops = ((2 * M * N * K) / secs) / 1e9

    let py = Python.import_module("builtins")
    _ = py.print(
        py.str("{:<13}{:>8.3f} GFLOPS").format(
            name, gflops
        )
    )

fn main() raises:
    bench[loop_reorder_swizzled, "Loop reordered (swizzled):"]()
    bench[autotuned, "Autotuned:"]()

jon-chuang commented 9 months ago

The interesting thing is that when I rerun my code (i.e. mojo matmul.mojo), it doesn't run autotune (i.e. seems to hit JIT cache) , but the autotuned kernel actually runs without crashing, with decent perf. lol.

jon-chuang commented 9 months ago

I was able to recreate the issue with the following:

Clear the mojo and transform caches
run once (program crashes)
run again (runs fine)

make the following code change and rerun

fn main() raises:
-    bench[loop_reorder_swizzled, "Loop reordered (swizzled):"]()
-    bench[autotuned, "Autotuned:"]()
+    bench[autotuned, "Autotuned:"]()
+    bench[loop_reorder_swizzled, "Loop reordered (swizzled):"]()

This worked pretty deterministically. In fact, any code change in step 4 seemed to run into the error of the OP.

Mogball commented 9 months ago

Thanks so much! @jon-chuang . We'll take a look

modularml / mojo