tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
431 stars 60 forks source link

Device-only, named cbs/arrays have advantages #390

Closed coffeeandcheesecake closed 8 months ago

coffeeandcheesecake commented 1 year ago

Currently during device startup we have to initialize all CBs even those that we don't use. If instead a CB was constructed on demand we could simplify kernel setup. Host-side CB configuration is also error-prone and it's harder to follow the code that is split between host and device.

Named CBs are also easier to follow in the code. So on NCRISC we could have

CB scaler(address, tile_count); // or scaler(device_array);
scaler.push_back(tile);

while on TR0,1,2 we could have

CB scaler;
tile = scaler.pop_front(); // internally inside pop_front we'd have UNPACK(())

And there would be no host-side code to setup.

Allocation can be done dynamically, via something like ptr = thread_sync_alloc(size, thread_mask) api which would synchronize specified threads. Ptr would then be visible to each thread where the code is present (SIMT style code)

Then a CB can be created as CircularBuffer cb1(address); on the device and would initialize only 4 values instead of currently 16x4 values (llk_setup_operands()). It seems plausible that by saving these cycles we could break even on the perf.

Another possible option for improving perf is thread_sync_allocate_multiple() which woudl allocate multiple addresses and only do 1 sync.

We could do the same for device-side arrays (new type on device).

We can remove CB apis from host which will simplify programs - this seems like a big advantage.

The idea is basically to create CBs on the device and bind them to arrays that are created in the beginning of kernels.

Yet another option is to support both host-preallocated arrays and device-allocated arrays and then instantiate CBs on top of those.

As a side note, I'm now wondering how __shared__ allocator is implemented in CUDA :) They do both support in-kernel static shared and launch-config "dynamic" shared memory allocation.

DrJessop commented 1 year ago

Things I like:

Things I need more clarity on:

davorchap commented 8 months ago

No plans to change CB APIs