rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

Conda doctor plugin to check driver/CUDA/cudatooklit matches #311

Open jacobtomlinson opened 8 months ago

jacobtomlinson commented 8 months ago

Last year the conda doctor command was released which allows you to run checks on your conda environment.

The default checks that come with conda go through all of your installed packages and verify the file against the package manifest to ensure nothing on the local filesystem has been corrupted.

However, it looks like there is a way to write plugins for conda doctor to run your own arbitrary checks. It would be neat if we wrote a plugin to add some of our own checks to ensure the conda environment is set up correctly for NVIDIA/RAPIDS.

The first check we could write could check the installed NVIDIA driver version against the CUDA version and cudatoolkit version and ensure that everything is compatible. It's quite common for users to pull a container image that doesn't match the driver on the host system and run into hard to debug errors as a result.

It would be awesome if we could run conda doctor in our container to verify everything is ok.

We need to dig more into how to package conda doctor plugins but I assume it uses an entrypoint or similar to register custom checks, so I expect we will need to install the plugin as a Python package.

A few steps to get started: