Operator Fusion - Githubissues

Crucial and ties into Code Generation.

allocations

The above graph demonstrates the success of our current inplacing algorithm.

However, we need to take this a step further and go from Inplacing to Inlining.

fn main(...) {
    let x_offset = group_id.x * 64u;
    var dst_offset = (group_id.y * num_groups.x * 64u) + x_offset + local_index;

    //Convert 1D offset into 4D index
    let dst_index = offsetToNdIndex(dst_offset, metadata.dst_stride);

    var src_index = vec4<u32>(0u);
    src_index[metadata.perm[0]] = dst_index[0]; 
    src_index[metadata.perm[1]] = dst_index[1];
    src_index[metadata.perm[2]] = dst_index[2];
    src_index[metadata.perm[3]] = dst_index[3];

    //Convert 4D index into 1D offset
    let src_offset = ndIndexToOffset(src_index, metadata.src_offsets, metadata.src_stride);

    Y[dst_offset] = X[src_offset];
}

The above is our current permute shader. Instead of performing subsequent injective operations on the output buffer of permute, we could inline all of the injective operations like so:

fn main(...) {
    //omit
    Y[dst_offset] = cos(exp(gelu(X[src_offset])
}

This (contrived) example would cause everything to be collapsed to a single node, and is super important.

huggingface / ratchet

Operator Fusion #194