tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
309 stars 27 forks source link

FD2 more CB optimizations #7497

Closed pgkeller closed 22 hours ago

pgkeller commented 3 months ago

1) Running with more CBs has a ~linear cost - looks like CBs use write packed command rather than sending with 1 write from low->high water mark

pgkeller commented 2 months ago

@tt-aho can you close this out if resolved? thanks

pgkeller commented 1 month ago

checked, not resolved. CBs are using packed writes, should use a linear write from low..high. same for semaphores, actually, just one write.

davorchap commented 1 month ago

checked, not resolved. CBs are using packed writes, should use a linear write from low..high. same for semaphores, actually, just one write.

Totally agree, a single [low..high] write, even if it means send all 32 CBs and all 4 sems

davorchap commented 1 month ago

checked, not resolved. CBs are using packed writes, should use a linear write from low..high. same for semaphores, actually, just one write.

Totally agree, [low..high] , even if it means send all 32 CBs and all 4 sems

This is probably in the critical path to RN50 perf , so setting to P1 to be handled as part of that effort

tt-aho commented 1 month ago

We use packed write for unique core range/cb config settings.

But we write from 0..high for each subcmd. Didn't add the optimization for low since in almost all cases, low would be 0 since that's the first input cb.

tt-aho commented 1 month ago

I think this can be closed unless we want to either change from 0 -> lowest cb, or take a look at the semaphores (I haven't looked at how semaphores are handled yet)

davorchap commented 1 month ago

I think this can be closed unless we want to either change from 0 -> lowest cb, or take a look at the semaphores (I haven't looked at how semaphores are handled yet)

0 is fine We'd probably want the same scheme for Semaphores

Do we know why first half of RN50 convs are slow?

pgkeller commented 1 month ago

0..max is fine. when I run w/ just 1 core range I get linear slow down w/ number of CBs, you are saying it is one packed entry for all the used CBs? Or one packed entry per CB? I think you are saying the former, but perf seems like the latter (when I get back to this I can print to be sure)

tt-aho commented 1 month ago

0..max is fine. when I run w/ just 1 core range I get linear slow down w/ number of CBs, you are saying it is one packed entry for all the used CBs? Or one packed entry per CB? I think you are saying the former, but perf seems like the latter (when I get back to this I can print to be sure)

Yes, it should be set up to be the former (1 packed cmd for all cbs on the core range).

pgkeller commented 1 month ago

there is a linear cost in brisc.cc initializing the CBs, maybe I'm seeing that. I'll measure that too when I get back to this (though that was about 20 cycles per CB which isn't much)

tt-aho commented 1 month ago

I think this can be closed unless we want to either change from 0 -> lowest cb, or take a look at the semaphores (I haven't looked at how semaphores are handled yet)

0 is fine We'd probably want the same scheme for Semaphores

Do we know why first half of RN50 convs are slow?

I confirmed with tracy that it's slow due to sending kernel binaries (My machine crashed and I lost the tracy report, will regenerate and attach image). For example, for first conv we are doing 30+ write linears for kernel binaries (haven't looked at why it is so many/if we can reduce), and the first few writes from cq_dispatch are fast, but we start getting throttled by prefetcher so the later write linears are mostly waiting for data from prefetcher.

pgkeller commented 1 month ago

I think this can be closed unless we want to either change from 0 -> lowest cb, or take a look at the semaphores (I haven't looked at how semaphores are handled yet)

0 is fine We'd probably want the same scheme for Semaphores Do we know why first half of RN50 convs are slow?

I confirmed with tracy that it's slow due to sending kernel binaries (My machine crashed and I lost the tracy report, will regenerate and attach image). For example, for first conv we are doing 30+ write linears for kernel binaries (haven't looked at why it is so many/if we can reduce), and the first few writes from cq_dispatch are fast, but we start getting throttled by prefetcher so the later write linears are mostly waiting for data from prefetcher.

We don't pack binaries, so this sounds like 6 kernel groups each w/ B+N+T. I was thinking we could wait on binary packing until ring buffer or implement binary packing + a new packed_write command now and ditch it with ring buffer...but, if this is 6 kernel groups, we might want that packed write command even with the ring buffer (would reduce the 6 to 1). actually, this would be a packed-read flowing into a packed-write, that would hide the dram read latency between kernel groups. GS dram latency is particularly bad (relative to WH). hmm.

tt-aho commented 1 month ago

Should we split the kernel bin discussion into a separate issue?

For cbs we can also look into densely packing the used cbs. A lot ops that follow the standard input/output cb flow wastes a lot of cb initialization, since input cb starts at 0, but output cbs start at 16, so if it's just 1 I/O cb results in 17 initializations

pgkeller commented 1 month ago

Filed #8857 for packing kernel binaries

7493 already tracks the API changes for packing CB indices

pgkeller commented 22 hours ago

I think this is stale at this point, closing, we'll re-open if future perf investigations show an issue