google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.46k stars 1.27k forks source link

Relax NVIDIA ABI version check for version ranges with no changes #10628

Open EtiennePerot opened 1 month ago

EtiennePerot commented 1 month ago

Description

Currently, nvproxy's ABI version tree describes the ABI of each individual version number. This means that users need to have exactly the right driver version in order to use runsc. This hampers usability of nvproxy; see #10605 and #10624 for recent examples.

I propose that nvproxy's logic is relaxed for version ranges with no ABI differences. In other words, if the ABI has not changed from version 1.20.30 to version 1.40.50, then when running on a host with driver version 1.25.10, nvproxy should detect that this version falls in the middle of the range and therefore it can automatically decide to use its definition for ABI version 1.40.50.

This of course assumes that every version in the middle of this range indeed doesn't have any ABI changes. Therefore, in order to support this feature, the first task is to retroactively verify that this is the case between existing supported nvproxy ABI versions. For example, if there was any ABI change that was later reverted in the middle of the range, the nvproxy ABI version tree needs to have this range split to reflect the reality. An NVIDIA driver diffing tool should help here.

Is this feature related to a specific bug?

N/A

Do you have a specific solution in mind?

See above.

EtiennePerot commented 1 month ago

/cc @AC-Dap @ayushr2

ayushr2 commented 1 month ago

That makes sense. Maybe we need to re-organize how the version->abi data is stored into something like a segment tree, which is more effective in indicating ranges.

This of course assumes that every version in the middle of this range indeed doesn't have any ABI changes.

Yes, this is a big assumption because roll-backs and roll-forwards are possible within ranges. The driver diffing tool is necessary here.

nixprime commented 1 month ago

This of course assumes that every version in the middle of this range indeed doesn't have any ABI changes. Therefore, in order to support this feature the first task is to retroactively verify that this is the case between existing supported nvproxy ABI versions.

IIUC this isn't a one-off cost, we'd also need to verify this property for every future version, extending the driver versions supported by nvproxy to "every version within some min/max bounds", which seems like potentially quite a lot of dev burden.

EtiennePerot commented 1 month ago

we'd also need to verify this property for every future version, extending the driver versions supported by nvproxy to "every version within some min/max bounds", which seems like potentially quite a lot of dev burden.

True, but I don't think this is a bad thing to do even without this feature. If between version A and B, an NVIDIA struct is changed and then is changed back within the range, chances are that the meaning of the fields in that struct may also have changed between A and B even if the fields are back to being identical. At least, that probability is much higher than the case where the struct didn't change at all in the version range. So it'd be a good time to look at that struct again.