microsoft / planetary-computer-containers

Container definitions for the Planetary Computer
MIT License
52 stars 12 forks source link

Unable to start CUDA Context #40

Open vwmaus opened 2 years ago

vwmaus commented 2 years ago

I started clean GPU PyTorch instance where I tried to run the tutorial landcover.ipynb, but it failed in:

cluster = LocalCUDACluster(threads_per_worker=4)
2022-06-14 11:31:42,398 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Unable to start CUDA Context
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/srv/conda/envs/notebook/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/srv/conda/envs/notebook/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_cuda/initialize.py", line 41, in _create_cuda_context
    ctx = has_cuda_context()
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 120, in has_cuda_context
    running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses_v2(handle)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pynvml/nvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

I was not sure where to report this issues.

TomAugspurger commented 2 years ago

Thanks for reporting, taking a look now.

TomAugspurger commented 2 years ago

A couple of things:

  1. The LocalCUDACluster does get created successfully. The text you posted is a warning printed to stderr by one of distributed, dask-cuda, or pynvml.
  2. I think the GPU is still being used. nvidia-smi doesn't show any processes using the GPU, but I do see the GPU memory increasing and we explicitly using things like torch.device("cuda") which should fail if the GPU weren't being used (I think).

So most likely there's an issue with our environment at https://github.com/microsoft/planetary-computer-containers/tree/main/gpu-pytorch. We're pulling dask-cuda from PyPI currently, but pynvml from conda-forge (https://github.com/microsoft/planetary-computer-containers/blob/535748775b971f2fe406c99a1714dd3d663f1db8/gpu-pytorch/conda-linux-64.lock#L250). I'm not sure exactly what is going wrong.

vwmaus commented 2 years ago

Thanks @TomAugspurger, it seems version incompatibility. This rapidsai/dask-cuda/issues/909 could be of help. They also report AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2