tenstorrent / tt-umd

User-Mode Driver for Tenstorrent hardware
Apache License 2.0
9 stars 4 forks source link

Move TLB setup to UMD from compiler runtime #44

Open pjanevskiTT opened 2 months ago

pjanevskiTT commented 2 months ago

Problem

Currently we don't setup the TLBs inside UMD, just provide the API for compiler runtimes to be able to do it. It looks like we could move this logic to UMD, since we already have layout of TLBs and mapping of TLBs that all of compiler runtimes are following (Buda and Metal), which come mostly from syseng guide. It seems that there is no much room to map TLBs differently, since the layout we are using seems to be working fine and there is no clear more optimal mapping (for example 4GB TLBs on BH can't really be used for something else, 2MB TLBs are also fixed at certain positions)

It also leads to code duplicates like in tlb_config files in Buda and Metal. If someone wants to write some simple tests using only UMD and some runtime code it also needs to setup TLBs manually outside of UMD, which can be a bit tiring.

Proposed solution

Basically it would be nice to move the TLB setup inside UMD, something similar to tlb_config.cpp files in compilers. We can still do few things in order to keep flexibility:

For example we could have something like


// functions inside some class like tt_SiliconDevice
void setup_tlbs();

void setup_hugepages();

void setup_device_communication() {

   setup_tlbs();

   setup_hugepages();
}

So we can allow runtime to call setup_device_communication and UMD is going to setup the TLBs (and everything else) or runtime can do it like in tlb_config.cpp through already existing API if it wants to do something different.

The proposed API is not 100% exact since we probably first need to make more high level plan of how UMD code is going to be organized, but I think the point is clear

joelsmithTT commented 2 months ago

Thanks for raising this and proposing a solution. Moving static TLB assignment into UMD could indeed reduce code duplication and reduce the burden of writing standalone tests. However, I'd like to explore some additional considerations that might impact this approach.

  1. How does static assignment address the challenge of chips having more addressable memory than what can be covered by the windows? It seems dynamic assignment is necessary.
  2. How can we account for scenarios where multiple applications need to access the device simultaneously? For instance, gathering telemetry across a chip topology while an ML workload is active.
  3. Given that some of the tools we've shipped do not use UMD (and thus lack knowledge of both the ML application's static allocation scheme and UMD's dynamic TLB synchronization mechanism), how can we ensure cooperation between such tools and user workloads?

Considering these points, I'd like to propose an alternative approach that might address both the original concerns and these additional considerations:

  1. Have KMD administer TLB windows as a resource, allowing user processes to request windows and map them into their address space.
  2. Provide a UMD mechanism for ML applications to establish long-lived TLB window mappings for performance-critical MMIO.
  3. For non-performance critical MMIO, have UMD offer an interface for IO to arbitrary NOC endpoints, dynamically reassigning windows as needed.

I've written about 2 here.

pjanevskiTT commented 1 month ago

Hey, thanks for the detailed response. As a general comment, the attached doc looks super nice, it looks like you put a lot of thought into it, the proposed design makes a lot of sense. Just to give few minor comments and maybe we can start finalizing the proposed design and implement it.

How does static assignment address the challenge of chips having more addressable memory than what can be covered by the windows? It seems dynamic assignment is necessary.

Dynamic assignment would still be done through already existing API, when I said static I was thinking more about already mapping TLBs that are usually used for some purpose (tensix L1, DRAM etc...)

Provide a UMD mechanism for ML applications to establish long-lived TLB window mappings for performance-critical MMIO.

For non-performance critical MMIO, have UMD offer an interface for IO to arbitrary NOC endpoints, dynamically reassigning windows as needed.

This is something I wanted to point out as well, that we should prepare TLBs in advance, but just do it through UMD. I see you have proposed the same with TlbWindow class, so we can focus on that design later on.

Given that some of the tools we've shipped do not use UMD (and thus lack knowledge of both the ML application's static allocation scheme and UMD's dynamic TLB synchronization mechanism), how can we ensure cooperation between such tools and user workloads?

Didn't know this, thanks for pointing it out. One question about this. So we will need a way to manage resources at one TlbWindow granularity? For example, if the workload wants to use the TLB just for some time, and then free it up, we need to somehow return just one TLB as a resource back to kmd, if I understand it correctly. Am I seeing this right? This should not be the problem for us?

Regarding the doc, not really much to add, we will probably sync at some time in the future to work out the details of it, the proposed design looks nice