Open vwmaus opened 2 years ago
Thanks for reporting, taking a look now.
A couple of things:
LocalCUDACluster
does get created successfully. The text you posted is a warning printed to stderr by one of distributed, dask-cuda, or pynvml.nvidia-smi
doesn't show any processes using the GPU, but I do see the GPU memory increasing and we explicitly using things like torch.device("cuda")
which should fail if the GPU weren't being used (I think).So most likely there's an issue with our environment at https://github.com/microsoft/planetary-computer-containers/tree/main/gpu-pytorch. We're pulling dask-cuda
from PyPI currently, but pynvml from conda-forge (https://github.com/microsoft/planetary-computer-containers/blob/535748775b971f2fe406c99a1714dd3d663f1db8/gpu-pytorch/conda-linux-64.lock#L250). I'm not sure exactly what is going wrong.
Thanks @TomAugspurger, it seems version incompatibility. This rapidsai/dask-cuda/issues/909 could be of help. They also report AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2
I started clean
GPU PyTorch
instance where I tried to run the tutoriallandcover.ipynb
, but it failed in:I was not sure where to report this issues.