The above graph demonstrates the success of our current inplacing algorithm.
However, we need to take this a step further and go from Inplacing to Inlining.
fn main(...) {
let x_offset = group_id.x * 64u;
var dst_offset = (group_id.y * num_groups.x * 64u) + x_offset + local_index;
//Convert 1D offset into 4D index
let dst_index = offsetToNdIndex(dst_offset, metadata.dst_stride);
var src_index = vec4<u32>(0u);
src_index[metadata.perm[0]] = dst_index[0];
src_index[metadata.perm[1]] = dst_index[1];
src_index[metadata.perm[2]] = dst_index[2];
src_index[metadata.perm[3]] = dst_index[3];
//Convert 4D index into 1D offset
let src_offset = ndIndexToOffset(src_index, metadata.src_offsets, metadata.src_stride);
Y[dst_offset] = X[src_offset];
}
The above is our current permute shader. Instead of performing subsequent injective operations on the output buffer of permute, we could inline all of the injective operations like so:
you'll want to introduce an IR that keeps track of the size of each tensor and the "type" of each operation
you can coalesce operations with the same "type" - for the example you've given, you have elementwise operations of cos / exp / gelu - you can bundle these into a single node
for this, runtime code generation will be needed for each IR node, as you will no longer know ahead of time what your final execution environment will look like
Crucial and ties into Code Generation.
The above graph demonstrates the success of our current inplacing algorithm.
However, we need to take this a step further and go from
Inplacing
toInlining
.The above is our current permute shader. Instead of performing subsequent injective operations on the output buffer of
permute
, we could inline all of the injective operations like so:This (contrived) example would cause everything to be collapsed to a single node, and is super important.