Open klueska opened 2 weeks ago
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: klueska
The full list of commands accepted by this bot can be found here.
The pull request process is described here
In putting this together, it's become obvious that alot of what is being "templated" would be repeated in each and every slice. Would it be possible to create a separate API server object to hold the "template" objects for a given driver that can then be referenced by its resource slices? Possibly even leveraging a config map to do it instead of defining a new type.
In putting this together, it's become obvious that alot of what is being "templated" would be repeated in each and every slice. Would it be possible to create a separate API server object to hold the "template" objects for a given driver that can then be referenced by its resource slices? Possibly even leveraging a config map to do it instead of defining a new type.
Certainly it's possible, the question is whether it is worth the complexity, since then you have another independent object that can change or be missing, etc. This would help for any of the options 2, 4+ actually.
One thing we may want to think about is which factors drive scale and which are likely to grow over time, fastest:
Ps
)- if we think GPU sizes / memory blocks are going to increase dramatically, this number will increasePpd
)- similarly, this will increase even more as memory blocks increaseDpn
) - I expect this will stay around 8-16 for quite some time, WDYT?N
) - varies per cluster, but we should think O(10,000) at least, if not 10x that in the long run based on historical trendsSpn
) - depends on the particular slice size and specific slice design choicesWe can characterize each suggestion then based on which of these scaling factors are relevant:
N
(holding others fixed), but suggestions like the one quoted above can reduce the scale constant for some of the options. Nonetheless, for now let's think per node.O(Ppd * Dpn)
O(Spn)
for the factored out common attributesO(Ppd * Dpn)
for the restO(Spn * Ppd)
for the device shapeO(Dpn)
for the restO(Spn * Ps)
- I suspect that Ps = O(log Ppd)
so this is an improvement, to basically O(Spn * log Ppd)
O(Dpn)
for the restO(Spn * Ps)
- since you list each partition shapeO(Ppd * Dpn)
- since you explicitly list each device partitionThe suggestion above would change these (on a per node basis):
O(Spn)
for the common attributes but the scaling factor would changeO(Spn)
for the device shape since it would just be a constant referenceO(Spn)
for the device shapeO(Spn)
for the templatesSetting that aside, going back to the options without that suggestion, it would be possible to merge options 4 and 6 (option "10"...no, better stick with 7), such that: 1) We capture each partition shape once like in option 6; 2) Implicitly generate partitions like in option 4. If we did that, we would have:
O(Spn * Ps)
for the shapes/templatesO(Dpn)
- for the restwhich seems like the best we can do while keeping the repeated items in the slice.
Thinking more, I really do think that the things that will likely increase the most in the next 3-5 years are:
This means that factoring out things that are duplicated per slice is a good idea, as number of slices will increase with N
. Not only that, but if the "front matter" - the duplicated things like shapes/templates - increase in size, we leave less and less space for the actual devices. This causes an increase in slices per node!
In other words, let's try to prevent growth being a multiplicative factor of N
with either Ps
or Ppd
.
This makes me think our best bet is going to be:
O(Ps * Ppd)
, but NOT with N
.Ps
and Ppd
). Thus, the total for this becomes O(N * Spn * Dpn)
. Since we expect Dpn
to be relatively fixed, and since we moved all the "growth" out of the slice, Spn
will also be fixed, so this is effectively O(N)
, which is really the best we can do.I hadn't put the numbers together, but your conclusion at the end is where my head was when suggesting this. There will still need to be some per-slice "template" data (e.g. the pcie-root
attributes from my example), but it would be info that is relevant just to the devices in the slice, so it actually lives in the appropriate place.
I picture one "front matter" object per GPU type which defines everything that is non-node-specific. And then each device in a resource slice has fields that point to a specific "front matter" object and then pull bits and pieces from it as appropriate.
Simple devices can still be just a named list of attributes, but if you want anything more sophisticated you have to start using this more complex structure.
I picture one "front matter" object per GPU type which defines everything that is non-node-specific. And then each device in a resource slice has fields that point to a specific "front matter" object and then pull bits and pieces from it as appropriate.
Yes, that's what I am thinking too. Basically push the invariant stuff across nodes into a separate object, and then refer to it. Those "front matter" pieces are probably constant for a given combination of hardware, firmware and driver versions.
FYI I added this as "Option 6" as well as the "Option 7" here: https://github.com/kubernetes-sigs/wg-device-management/issues/20#issuecomment-2168189769
In relation to what came up in the call tonight ...
Instead of having a single centralized object with all of the "front matter", we could have have one "front matter" object per node that all of the slices for that node refer to. It would likely have redundant information to most other nodes, but then we at least keep the front-matter separate from the resource slices that consume it (and if a driver does want to go through the headache of centralizing it, they still can).
I haven't yet written this up properly (or added any code for it), but I wanted to push something out there with my thoughts around how to support partitioning in a more compact way.
Below is the (incomplete) YAML for what one A100 GPU with MIG disabled, one A100 with MIG enabled, and one H100 GPU (regardless of MIG mode) would look like. I am currently only showing the full GPUs and the
1g.*gb
devices (because I wrote this by hand), but you can imagine how it would be expanded with the rest.Most of it is self-explanatory, except for 1 thing -- what the new
sharedCapacityInstances
field on a device implies. It is a way to define a "boundary" for any shared capacity referenced in a device template. Meaning that all devices that provide the same mappings for a givensharedCapacityInstance
will pull from the sameSharedCapacity
.I will add more details soon (as well as a full prototype), but I wanted to get this out for initial comments before then.