gpuopenanalytics / pynvml

Provide Python access to the NVML library for GPU diagnostics
BSD 3-Clause "New" or "Revised" License
212 stars 30 forks source link

Relocate `smi` module #56

Closed rjzamora closed 2 months ago

rjzamora commented 2 months ago

This PR relocates pynvml.smi to a new pynvml_utils module so that pynvml can become a "meta package" for nvidia-ml-py in a future release.

pentschev commented 2 months ago

Do we want to immediately move pynvml.smi to pynvml_utils (breaking change) or initially make it an alias with a warning and make the change in the future? If we want to get a metapackage built soon, perhaps we'll need to make the change immediately, so I'm just raising the question to make sure its impact has been considered.

rjzamora commented 2 months ago

Do we want to immediately move pynvml.smi to pynvml_utils (breaking change) or initially make it an alias with a warning and make the change in the future? If we want to get a metapackage built soon, perhaps we'll need to make the change immediately, so I'm just raising the question to make sure its impact has been considered.

This is definitely a key question, so thanks for asking it explicitly :)

I was thinking that we could cut an 11.5.1 release with the pynvml.smi deprecation warning, and then wait a few weeks to cut a 12.0.0 release with the "breaking" change. This would allow us to make pynvml a "meta package" in time for the rapids-24.10 release cycle.

However, since the 12.555+ bindings are probably needed for backward compatibility, I suppose we could also update the nvml bindings before cutting a (12.5.0) release with the pynvml.smi deprecation warning. This way we wont be "stuck" on 11.4 for the rapids-24.08 release cycle, and libraries using pynvml.smi will (hopefully) see a useful warning.

Either way, it feels a bit "off" to remove pynvml.smi without some kind of deprecation cycle (even if it is extremely accelerated).

rjzamora commented 2 months ago

cc @jacobtomlinson @jakirkham @quasiben

Curious if you guys think this PR takes us in a reasonable direction? The idea is to:

My intention is to cut a release after this is merged so that users will see the pynvml.smi deprecation. We could also include one last update to the nvml bindings before cutting that release, but I would like to shift to a "meta package" soon. Once pynvml is a meta package, it should become much easier for libraries to migrate from pynvml to nvidia-ml-py.

Some remaining questions:

  1. Does this general meta-package plan seem okay?
  2. Should we update the nvml bindings before the deprecation release, or keep them at 11.5?
  3. This plan introduces a new nvml_utils module - Am I handling versioning correctly? (versioning/packaging is not my wheelhouse)
pentschev commented 2 months ago

I was thinking that we could cut an 11.5.1 release with the pynvml.smi deprecation warning, and then wait a few weeks to cut a 12.0.0 release with the "breaking" change. This would allow us to make pynvml a "meta package" in time for the rapids-24.10 release cycle.

That makes sense to me.

However, since the 12.555+ bindings are probably needed for backward compatibility

What I don't know is whether 12.555+ is compatible with CUDA 11.x versions, do you know if it is? We still support CUDA 11.2 and up, thus we must use an NVML version that is compatible with that to a minimum.

I suppose we could also update the nvml bindings before cutting a (12.5.0) release with the pynvml.smi deprecation warning. This way we wont be "stuck" on 11.4 for the rapids-24.08 release cycle, and libraries using pynvml.smi will (hopefully) see a useful warning.

How much work is that? I don't know if it's absolutely necessary to update the bindings or if it would be enough to just (when we're ready) jump to the meta package. Overall, I think we're fine with pynvml=11.4 for RAPIDS 24.08, there haven't been any known problems with it that I'm aware of.

jacobtomlinson commented 2 months ago

This seems like a good idea to me. Just one question about the naming, this PR proposes nvml_utils. Should that be pynvml_utils or even nvutil which would allow us to persue that path in the future without moving things again?

rjzamora commented 2 months ago

What I don't know is whether 12.555+ is compatible with CUDA 11.x versions, do you know if it is? We still support CUDA 11.2 and up, thus we must use an NVML version that is compatible with that to a minimum.

I'm honestly not sure. I think @wence- said the latest nvidia-ml-py release was backward compatible?

How much work is that? I don't know if it's absolutely necessary to update the bindings or if it would be enough to just (when we're ready) jump to the meta package. Overall, I think we're fine with pynvml=11.4 for RAPIDS 24.08, there haven't been any known problems with it that I'm aware of.

It wouldn't be much work at all (just need to copy a file from nvidia-ml-py into pynvml), but I agree that we shouldn't do it if there is a possibility that rapids would still need to pin to 11.4 anyway. The only motivation I have to update the bindings is to remove the 11.4 pin in RAPIDS so pynvml.smi users will actually see the deprecation warning.

this PR proposes nvml_utils. Should that be pynvml_utils or even nvutil which would allow us to persue that path in the future without moving things again?

Good question :)

@pentschev - What do you think?

wence- commented 2 months ago

I'm honestly not sure. I think @wence- said the latest nvidia-ml-py release was backward compatible?

From my reading of the code, it looks like it is, but I don't have a cuda-11.2 driver to check with.

pentschev commented 2 months ago

@rjzamora sorry, just saw the questions about naming now. We discussed in our weekly sync, it seems like, at least for now, it makes more sense to call it pynvml_utils.smi as there's no clear path for the NVUtil work to move here. But I'll defer the final decision to you, I'm fine either way.

rjzamora commented 2 months ago

cc @jakirkham (in case you have any concerns with this direction)

jakirkham commented 1 month ago

Thanks all for working on this! 🙏

Let me try to summarize my understanding and you can correct me if I missed anything:

  1. This moves pynvml.smi to pynvml_utils with the former giving a deprecation warning
  2. A new PyNVML release was made (11.5.3) with this change
  3. Ship this in RAPIDS 24.08 (giving us one release with the deprecation)
  4. Cleanup deprecated usage in RAPIDS (guessing as part of 24.10?)
  5. Update PyNVML to depend on nvidia-ml-py
  6. Make a new PyNVML release (currently RAPIDS uses 12.5.1 so this would be ideal)

Will pause here for feedback

rjzamora commented 1 month ago

Thanks for summarizing @jakirkham !

  1. Ship this in RAPIDS 24.08 (giving us one release with the deprecation)

This was my original plan. However, it is my understanding that dask-cuda will need to jump directly from pynvml<11.5 to pynvml-12.* (once we release it). This is because there is a backward-compatibility problem with the 11.5 bindings (and is still pinned to an earlier release).

I'm very-much open to ideas :)

currently RAPIDS uses 12.5.1 so this would be ideal

Anything using dask-cuda is currently pinned to pynvml<11.5 :/

jakirkham commented 1 month ago

Thanks Rick! 🙏

Yeah think we should just plan on upgrading all of RAPIDS in lock step to the new PyNVML. Was planning to bring this up when we start focusing on RAPIDS 24.10

So a PyNVML 12 release early in the RAPIDS 24.10 development cycle would be very helpful