Currently during device startup we have to initialize all CBs even those that we don't use. If instead a CB was constructed on demand we could simplify kernel setup. Host-side CB configuration is also error-prone and it's harder to follow the code that is split between host and device.
Named CBs are also easier to follow in the code. So on NCRISC we could have
CB scaler(address, tile_count); // or scaler(device_array);
scaler.push_back(tile);
Allocation can be done dynamically, via something like ptr = thread_sync_alloc(size, thread_mask) api which would synchronize specified threads. Ptr would then be visible to each thread where the code is present (SIMT style code)
Then a CB can be created as CircularBuffer cb1(address); on the device and would initialize only 4 values instead of currently 16x4 values (llk_setup_operands()). It seems plausible that by saving these cycles we could break even on the perf.
Another possible option for improving perf is thread_sync_allocate_multiple() which woudl allocate multiple addresses and only do 1 sync.
We could do the same for device-side arrays (new type on device).
We can remove CB apis from host which will simplify programs - this seems like a big advantage.
The idea is basically to create CBs on the device and bind them to arrays that are created in the beginning of kernels.
Disadvantages:
__syncthreads() across a few threads might be needed to synchronize allocators (can we avoid or defer this?)
The setup for the interface of CB would also need to be shared/synchronized or somehow baked into the allocation call (possibly as a flag) to reduce the number of syncs.
naming can also be accomplished via creating aliases for cb ids on device
Yet another option is to support both host-preallocated arrays and device-allocated arrays and then instantiate CBs on top of those.
As a side note, I'm now wondering how __shared__ allocator is implemented in CUDA :) They do both support in-kernel static shared and launch-config "dynamic" shared memory allocation.
(kind of not on topic to your main point, but I like nonetheless) named CBs, I like the idea of having a CB object and all of the APIs that we call are CB methods
Things I need more clarity on:
ptr = thread_sync_alloc(...) ... this ptr is the address passed into the CB constructor?
How are we planning on dealing with shared CBs with different CB ids? Recall, matmul has two CBs pointing to the same address (intermediate, output), and only when we push into output CB the writer can consume
Currently during device startup we have to initialize all CBs even those that we don't use. If instead a CB was constructed on demand we could simplify kernel setup. Host-side CB configuration is also error-prone and it's harder to follow the code that is split between host and device.
Named CBs are also easier to follow in the code. So on NCRISC we could have
while on TR0,1,2 we could have
And there would be no host-side code to setup.
Allocation can be done dynamically, via something like
ptr = thread_sync_alloc(size, thread_mask)
api which would synchronize specified threads. Ptr would then be visible to each thread where the code is present (SIMT style code)Then a CB can be created as
CircularBuffer cb1(address);
on the device and would initialize only 4 values instead of currently 16x4 values (llk_setup_operands()). It seems plausible that by saving these cycles we could break even on the perf.Another possible option for improving perf is
thread_sync_allocate_multiple()
which woudl allocate multiple addresses and only do 1 sync.We could do the same for device-side arrays (new type on device).
We can remove CB apis from host which will simplify programs - this seems like a big advantage.
The idea is basically to create CBs on the device and bind them to arrays that are created in the beginning of kernels.
Yet another option is to support both host-preallocated arrays and device-allocated arrays and then instantiate CBs on top of those.
As a side note, I'm now wondering how
__shared__
allocator is implemented in CUDA :) They do both support in-kernel static shared and launch-config "dynamic" shared memory allocation.