tenstorrent / tt-umd

User-Mode Driver for Tenstorrent hardware
Apache License 2.0
6 stars 3 forks source link

Global mutex issues #72

Open broskoTT opened 5 days ago

broskoTT commented 5 days ago

After talking with @tt-vjovanovic, he raised an issue with me that there are some issues with global mutexes used:

The task at hand is to refactor usage of these mutexes a bit to address these issues. It is still not clear however what is the right path to achieve this:

joelsmithTT commented 5 days ago

The presence of these mutexes reflect a system design flaw. Consider the case of the ARC_MSG mutexes: like the other mutexes in UMD, they are implemented with shared memory so that they work in a multiprocess context. If two separate UMD-based applications attempt to message a device's ARC firmware simultaneously, the mutex will serialize the accesses. (The same is true for a single UMD-based application running with multiple threads).

The design is flawed because UMD is not the only software that can interact with hardware. Applications based on the Luwen library can message the ARC firmware without participating in UMD's mutex scheme. The same is true of Python-based tooling used internally.

There are a variety of bad techniques to try to solve this:

There is a good solution:

broskoTT commented 5 days ago

Thanks for additional info.

Two approaches in solving this issue, which you identified:

TTDRosen commented 4 days ago

TLDR for the below: I think that KMD, not UMD should be where the resource management for the chips lives.

I agree that we should have a single point of entry to simplify the device communication story. What is looks like you are proposing that we take the GPU route and turn UMD into a global dynamically linked library. This could work, but it's not clear to me how you gather and maintain a global view. From my understanding your two options for getting and maintaining a global view are to stick it in the driver (in which case why are we pretending that UMD is a requirement) or use the filesystem (at which point the original issues raise themselves again {also docker containers are harder to setup}). Furthermore Linux requires there be a single KMD per pci device but provides no such restrictions to UMD. Therefore KMD with its global view of our device's pci resources should be the one to arbitrate and hand those out.

In addition UMD is not trying to only do PCI resource management, it is also supposed to support simulations and whatever other interfaces customers want to implement. We should not fall into the trap of making it something for everyone.

UMD wants to be a high level "just read/write" interface. We should not tie a global install to an interface like that. I think it's far simpler to just install that as a dependency. We already see the trouble we have with fw flashing.