Rethink tt_SiliconDevice construction

tenstorrent / tt-umd

User-Mode Driver for Tenstorrent hardware

Apache License 2.0

9 stars 5 forks source link

Rethink tt_SiliconDevice construction #118

Open joelsmithTT opened 1 month ago

joelsmithTT commented 1 month ago

tt_SiliconDevice construction is clumsy:

TLB windows should be an implementation detail to the greatest extent possible. Do not require the caller to specify the relationship between TLB index and strings used to name them.
Hugepages, number of host memory channels should be an implementation detail. What UMD users really care about is memory that is device-accessible.
Flags controlling whether to do allocations and cleanup at construction indicate a broken design (see https://github.com/tenstorrent/tt-umd/issues/72)
tt_SiliconDevice isn't necessarily usable after it is constructed: depending on what the caller is trying to do, there are other methods that need to be invoked.

Part of the problem is that tt_SiliconDevice is not a device abstraction, but has grown into a god object. At the device level, a SOC descriptor makes sense, but a topology descriptor does not. So some of these issues will go away as the design is stratified to be more hierarchical.

broskoTT commented 4 weeks ago

During the experimental cleanups I did in the code, I found that it looks much easier to do tt_SiliconDevice -> tt::umd::Cluster thansition, and then lower the stuff which is chip specific to Chip. So "At the device level, a SOC descriptor makes sense, but a topology descriptor does not" -> topology descriptor should be left in this class, and soc will be moved to new class. This is just a technical comment.

A comment on the constructor: The constructor of tt_SiliconDevice should have 0 mandatory arguments. The constructor of future tt::umd::Chip should have one argument, device_id. Soc and topology descriptors should be fetched from umd, not given to.

broskoTT commented 4 weeks ago

tt_SiliconDevice isn't necessarily usable after it is constructed: depending on what the caller is trying to do, there are other methods that need to be invoked.

It is okay that something else has to be done as well. For example start/stop the device, and maybe some custom configuration prior to starting the device (maybe some custom tlb maps, this is up for discussion anyways, or setting some address parameters), which could even be modified in the same object after stopping/starting the device again. But I do agree that if there is a set of functions that are expected to be called each time after constructor, that should be part of constructor.

joelsmithTT commented 4 weeks ago

I found that it looks much easier to do tt_SiliconDevice -> tt::umd::Cluster thansition, and then lower the stuff which is chip specific to Chip

This is a good insight.

But I do agree that if there is a set of functions that are expected to be called each time after constructor, that should be part of constructor.

My view is that after the object is constructed, the user should be able to call its methods and expect them to work. If the object needs special treatment in between construction and being able to use it normally, then there is opportunity to change the design... a goal is to make objects difficult to use incorrectly.

broskoTT commented 4 weeks ago

If the object needs special treatment in between construction and being able to use it normally

I thought non-mandatory configuration, like something out of standard config. User can use the object normally and everything will work, but if they need something custom (like custom tlb maps), they should set them after construction.

pjanevskiTT commented 3 weeks ago

PR #131 closed it, that wasn't the intention, reopening it to keep track of all things that are not solved yet