iot-salzburg / gpu-jupyter

GPU-Jupyter: Leverage the flexibility of Jupyterlab through the power of your NVIDIA GPU to run your code from Tensorflow and Pytorch in collaborative notebooks on the GPU.
Apache License 2.0
708 stars 235 forks source link

Struggling to switch users and maintain full cuda support #108

Closed njacobson-nci closed 1 year ago

njacobson-nci commented 1 year ago

I'm trying to run this stack for a few different users and want to be able to switch the username of the notebook user when i stand up the image.

When I do this, the spawned terminals/notebooks under the switched user aren't correctly sourcing the jovyan bashrc and running bitsandbytes fails to find libcudart.so.

The command i'm using: (most basic version to eliminate any variables in my custom install and deployment) docker run --gpus all -it -p 8848:8888 --user root -e NB_USER="njacobson" -e CHOWN_HOME=yes -w "/home/njacobson" cschranz/gpu-jupyter:v1.5_cuda-11.6_ubuntu-20.04_python-only

I attach to the container as root and run mamba install cudatoolkit -y python -m pip install bitandbytes

python -m bitsandbytes this fails to init and can't find libcudadart.so

if I source the /home/jovyan/.bashrc python -m bitsandbytes works

Running a jupyter notebook via jupyterlab and importing bitsandbytes also fails. As does a terminal spawned via jupyterlab unless I source the jovyan .bashrc.

I've tried copying the jovyan .bashrc into my home and chowning it, this fixes new terminals but new notebooks still won't properly import bitsandbytes.

nvidia-smi nvcc and torch.cuda.is_available() work in notebooks.

Not sure if this belongs here or with docker-stacks, but figured I'd start here.

Thanks!

benz0li commented 1 year ago

@njacobson-nci The problem is that LD_LIBRARY_PATH is not preserved, which is essential for CUDA images.

And the default LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 must be set/extended to LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64^1 beforehand.

FYI @mathbunnyru

mathbunnyru commented 1 year ago

@benz0li as far as I understand, in docker-stacks images we do not rely on LD_LIBRARY_PATH. And --preserve-env doesn't preserve this environment variable. I'm not sure we should explicitly preserve it.

https://www.sudo.ws/docs/man/sudoers.man/

The dynamic linker on most operating systems will remove variables that can control dynamic linking from the environment of set-user-ID executables, including sudo.

benz0li commented 1 year ago

I'm not sure we should explicitly preserve it.

@mathbunnyru For the jupyter/docker-stacks you are not supposed to.

benz0li commented 1 year ago

But if someone builds the jupyter/docker-stacks on top of nvidia/cuda images, LD_LIBRARY_PATH must be preserved – i.e. start.sh modified accordingly.

mathbunnyru commented 1 year ago

Thanks @benz0li. It makes sense to me 👍

njacobson-nci commented 1 year ago

Setting LD_LIBRARY_PATH in the jupyter notebook does resolve the issue in notebooks.

It doesn't appear to be required in a terminal after sourcing the jovyan bashrc, and it's not set in that terminal session either.

Changing the start.sh to not check for an existing /home/{$NB_USER} does allow the script to copy over the jovyan directory correctly and then new terminals are properly set up, but it doesn't fix notebooks. I'll try preserving LD_LIBRARY_PATH and see how that does.

import os

os.environ["LD_LIBRARY_PATH"] = "/opt/conda/lib/:/usr/local/cuda/lib64/lib"

import bitsandbytes

njacobson-nci commented 1 year ago

@benz0li Applying this fix you provided does copy over the LD_LIBRARY_PATH to the environment of the jupyter notbooks, but libcudart.so is still not found.

'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64',

This path doesn't exist in this image or any of the other ones i've used recently. It is the default ld_library_path on the 11.6.2-cudnn-runtime base image, but those folders don't exist on that image either.

benz0li commented 1 year ago

This path doesn't exist in this image or any of the other ones i've used recently. It is the default ld_library_path on the 11.6.2-cudnn-runtime base image, but those folders don't exist on that image either.

That is correct. See also https://github.com/jupyter/docker-stacks/issues/1792#issuecomment-1487927643.
ℹ️ These paths /usr/local/nvidia/lib:/usr/local/nvidia/lib64 are kept for legacy reasons.

@benz0li Applying this fix you provided does copy over the LD_LIBRARY_PATH to the environment of the jupyter notbooks, but libcudart.so is still not found.

'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64',

You need to set/extend the path to LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64^1 beforehand.

njacobson-nci commented 1 year ago

Updating the start.sh to set the ld_library_path as you called out here still has issues, but that might be a bitsandbytes thing. 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64',

Appending /opt/conda/lib/ does resolve notebooks being able to find the libcudart.so.

This is what i added to the start.sh to fix it now LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64:/opt/conda/lib/" \

It still seems like there is a deeper issue with switching users in this manner, and I wonder if there are further bugs that will be experienced when applying this fix.

In a notebook, if I run " ! ll " this alias is not found despite being defined in the jovyan/njacobson .bashrc, is that expected?

benz0li commented 1 year ago

@njacobson-nci I can't help you any further as my images only use Python – and don't have Conda / Mamba installed.

njacobson-nci commented 1 year ago

Understood, I appreciate the help very much!

benz0li commented 1 year ago

P.S.: You can always install Conda / Mamba on user level in my images.