Closed njacobson-nci closed 1 year ago
@njacobson-nci The problem is that LD_LIBRARY_PATH
is not preserved, which is essential for CUDA images.
And the default LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
must be set/extended to LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64
^1 beforehand.
FYI @mathbunnyru
@benz0li as far as I understand, in docker-stacks images we do not rely on LD_LIBRARY_PATH.
And --preserve-env
doesn't preserve this environment variable.
I'm not sure we should explicitly preserve it.
https://www.sudo.ws/docs/man/sudoers.man/
The dynamic linker on most operating systems will remove variables that can control dynamic linking from the environment of set-user-ID executables, including sudo.
I'm not sure we should explicitly preserve it.
@mathbunnyru For the jupyter/docker-stacks you are not supposed to.
But if someone builds the jupyter/docker-stacks on top of nvidia/cuda images, LD_LIBRARY_PATH
must be preserved – i.e. start.sh
modified accordingly.
Thanks @benz0li. It makes sense to me 👍
Setting LD_LIBRARY_PATH in the jupyter notebook does resolve the issue in notebooks.
It doesn't appear to be required in a terminal after sourcing the jovyan bashrc, and it's not set in that terminal session either.
Changing the start.sh to not check for an existing /home/{$NB_USER} does allow the script to copy over the jovyan directory correctly and then new terminals are properly set up, but it doesn't fix notebooks. I'll try preserving LD_LIBRARY_PATH and see how that does.
import os
import bitsandbytes
@benz0li Applying this fix you provided does copy over the LD_LIBRARY_PATH to the environment of the jupyter notbooks, but libcudart.so is still not found.
'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64',
This path doesn't exist in this image or any of the other ones i've used recently. It is the default ld_library_path on the 11.6.2-cudnn-runtime base image, but those folders don't exist on that image either.
This path doesn't exist in this image or any of the other ones i've used recently. It is the default ld_library_path on the 11.6.2-cudnn-runtime base image, but those folders don't exist on that image either.
That is correct. See also https://github.com/jupyter/docker-stacks/issues/1792#issuecomment-1487927643.
ℹ️ These paths /usr/local/nvidia/lib:/usr/local/nvidia/lib64
are kept for legacy reasons.
@benz0li Applying this fix you provided does copy over the LD_LIBRARY_PATH to the environment of the jupyter notbooks, but libcudart.so is still not found.
'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64',
You need to set/extend the path to LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64
^1 beforehand.
Updating the start.sh to set the ld_library_path as you called out here still has issues, but that might be a bitsandbytes thing. 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64',
Appending /opt/conda/lib/ does resolve notebooks being able to find the libcudart.so.
This is what i added to the start.sh to fix it now LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64:/opt/conda/lib/" \
It still seems like there is a deeper issue with switching users in this manner, and I wonder if there are further bugs that will be experienced when applying this fix.
In a notebook, if I run " ! ll " this alias is not found despite being defined in the jovyan/njacobson .bashrc, is that expected?
Understood, I appreciate the help very much!
P.S.: You can always install Conda / Mamba on user level in my images.
I'm trying to run this stack for a few different users and want to be able to switch the username of the notebook user when i stand up the image.
When I do this, the spawned terminals/notebooks under the switched user aren't correctly sourcing the jovyan bashrc and running bitsandbytes fails to find libcudart.so.
The command i'm using: (most basic version to eliminate any variables in my custom install and deployment) docker run --gpus all -it -p 8848:8888 --user root -e NB_USER="njacobson" -e CHOWN_HOME=yes -w "/home/njacobson" cschranz/gpu-jupyter:v1.5_cuda-11.6_ubuntu-20.04_python-only
I attach to the container as root and run mamba install cudatoolkit -y python -m pip install bitandbytes
python -m bitsandbytes this fails to init and can't find libcudadart.so
if I source the /home/jovyan/.bashrc python -m bitsandbytes works
Running a jupyter notebook via jupyterlab and importing bitsandbytes also fails. As does a terminal spawned via jupyterlab unless I source the jovyan .bashrc.
I've tried copying the jovyan .bashrc into my home and chowning it, this fixes new terminals but new notebooks still won't properly import bitsandbytes.
nvidia-smi nvcc and torch.cuda.is_available() work in notebooks.
Not sure if this belongs here or with docker-stacks, but figured I'd start here.
Thanks!