Device-only, named cbs/arrays have advantages

coffeeandcheesecake commented 1 year ago

Currently during device startup we have to initialize all CBs even those that we don't use. If instead a CB was constructed on demand we could simplify kernel setup. Host-side CB configuration is also error-prone and it's harder to follow the code that is split between host and device.

Named CBs are also easier to follow in the code. So on NCRISC we could have

CB scaler(address, tile_count); // or scaler(device_array);
scaler.push_back(tile);

while on TR0,1,2 we could have

CB scaler;
tile = scaler.pop_front(); // internally inside pop_front we'd have UNPACK(())

And there would be no host-side code to setup.

Allocation can be done dynamically, via something like ptr = thread_sync_alloc(size, thread_mask) api which would synchronize specified threads. Ptr would then be visible to each thread where the code is present (SIMT style code)

Then a CB can be created as CircularBuffer cb1(address); on the device and would initialize only 4 values instead of currently 16x4 values (llk_setup_operands()). It seems plausible that by saving these cycles we could break even on the perf.

Another possible option for improving perf is thread_sync_allocate_multiple() which woudl allocate multiple addresses and only do 1 sync.

We could do the same for device-side arrays (new type on device).

We can remove CB apis from host which will simplify programs - this seems like a big advantage.

The idea is basically to create CBs on the device and bind them to arrays that are created in the beginning of kernels.

Disadvantages:
__syncthreads() across a few threads might be needed to synchronize allocators (can we avoid or defer this?)
The setup for the interface of CB would also need to be shared/synchronized or somehow baked into the allocation call (possibly as a flag) to reduce the number of syncs.
naming can also be accomplished via creating aliases for cb ids on device

Yet another option is to support both host-preallocated arrays and device-allocated arrays and then instantiate CBs on top of those.

As a side note, I'm now wondering how __shared__ allocator is implemented in CUDA :) They do both support in-kernel static shared and launch-config "dynamic" shared memory allocation.

DrJessop commented 1 year ago

Things I like:

(kind of not on topic to your main point, but I like nonetheless) named CBs, I like the idea of having a CB object and all of the APIs that we call are CB methods

Things I need more clarity on:

ptr = thread_sync_alloc(...) ... this ptr is the address passed into the CB constructor?
How are we planning on dealing with shared CBs with different CB ids? Recall, matmul has two CBs pointing to the same address (intermediate, output), and only when we push into output CB the writer can consume

davorchap commented 8 months ago

No plans to change CB APIs

tenstorrent / tt-metal

Device-only, named cbs/arrays have advantages #390