singularityhub / singularityhub.github.io

Container tools for scientific computing! Docs at https://singularityhub.github.io/singularityhub-docs
https://singularityhub.github.io
68 stars 9 forks source link

Anyone have cuda 10 working in a singularity container? #183

Closed infermachine closed 5 years ago

infermachine commented 5 years ago

I have a few nvidia 2080ti GPUs on a node in our cluster and I want to install cuda10 in a singularity container so I can work with them (primarily with pytorch). When I use this singularity base image:

From: nvidia/cuda:10.0-devel-ubuntu18.04

...the container seems to build fine, but when I try to singularity run/exec/shell the image I get this error:

/.singularity.d/actions/run: 7: /.singularity.d/env/10-docker.sh: cannot open 385: No such file

I also tried just building the barebones ubuntu and installing cuda manually, ie.

Bootstrap: docker
From: ubuntu
%post

apt-get update && apt-get -y install python3.7 git wget graphviz python3-venv python3.7-venv
apt-get install -y build-essential dkms
apt-get install -y freeglut3 freeglut3-dev libxi-dev libxmu-dev
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.168-1_amd64.deb
dpkg -i cuda-repo-ubuntu1804_10.1.168-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt-get update
apt-get install -y cuda

But then once I'm running the container and try running nvidia-smi I get the following error:

NVRM: API mismatch: the client has the version 430.26, but
                 NVRM: this kernel module has the version 410.93

I think the node has nvidia kernel 410.93 installed and this is interfering somehow? Anyone have any ideas how to fix this?

Thanks! DB

vsoch commented 5 years ago

Are you trying to build on Singularity Hub, or just on your host?

vsoch commented 5 years ago

This issue to me looks like you want to report it here https://github.com/sylabs/singularity/issues with the Singularity software. My .02 - you'll need to install the exact same libraries in the container that you have on the host, and don't forget to use the --nv flag! Other than that, you should open the issue on the board linked. Good luck!

infermachine commented 5 years ago

Ok yeah I'll post it over there instead. I posted here because this is the only place I could find buildfile examples with cuda10 (e.g. https://singularity-hub.org/containers/6713)... but most of those give the "cannot open 385: No such file" error.

I think you're right that I may need to update the nvidia drivers on the kernel or something because I can't seem to find how to install 410.93 on the client.

Thanks!

vsoch commented 5 years ago

Sure, glad to help, good luck!