Core dump when CUDA_VISIBLE_DEVICES set to 0.

jrounds commented 1 year ago

Hi,

Host is redhat os with two nvidia GPUs.

I built an environment like this everything works (CPU)

conda create -y --name py38_drjit_test python==3.8
conda activate py38_drjit_test
python3 -m pip install --upgrade pip
python3 -m pip install drjit
python3 -c "import drjit; print(drjit.__version__)"  #outputs 0.4.3

Add in tensorflow/cuda according to published instructions:

# Installing tensorflow based on latest instructions
conda install -y -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.13.0 
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$CUDNN_PATH/lib:$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Still good this all works as expected

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # two devices
python3 -c "import drjit; print(drjit.__version__)" #outputs 0.4.3

STarted messing with CUDA_VISIBLE_DEVICES (actually started at the last one, but showing the ones that work first)

This works (2 GPUS)

export CUDA_VISIBLE_DEVICES=0,1
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # two devices
python3 -c "import drjit; print(drjit.__version__)" #outputs 0.4.3

This works (Last GPU)

export CUDA_VISIBLE_DEVICES=1
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device
python3 -c "import drjit; print(drjit.__version__)"

This core dumps (repeatable in any pattern of these combinations)

export CUDA_VISIBLE_DEVICES=0
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device
python3 -c "import drjit; print(drjit.__version__)"

Actual output

(py38_drjit_test) [host]$ export CUDA_VISIBLE_DEVICES=0
(py38_drjit_test) [host]$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # one device

2023-08-30 11:00:34.838109: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-30 11:00:34.880618: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-30 11:00:35.631433: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
(py38_drjit_test) [host]$ python3 -c "import drjit; print(drjit.__version__)"
Segmentation fault (core dumped)

So setting CUDA_VISIBLE_DEVICES to first gpu of 2 results in core dump of on import of drjit, but setting it to last does not have that effect?

Any advice on what to consider to work through this?

merlinND commented 1 year ago

Hello @jrounds,

What are the models of the two GPUs, is there anything different between them? What is your NVIDIA driver version?

I routinely select GPU 0 or 1 using CUDA_VISIBLE_DEVICES and haven't encountered this crash.

Just checking, does the presence of Tensorflow in the Conda env influence the crash? E.g. if you just create a Python virtualenv and pip install drjit, can you reproduce the crash?

jrounds commented 1 year ago

I think we started speculating there may be something on gpu0 that matters. I didn't investigate that. core dump is a less than ideal message, but it may be user error. I don't see what it could be though. Certainly nvidia-smi is not giving a clue.

GPUs: 2x Quadro RTX 6000 Driver: 535.104.05

I am going to close this because I have an effect work around and I am not convinced it isnt our machine.

merlinND commented 1 year ago

If you feel like this is a problem with DrJit, the error message would likely be much more informative using a debug build (https://drjit.readthedocs.io/en/latest/firststeps-cpp.html#first-steps-in-c).

jrounds commented 1 year ago

We ended up fixing it. Actually not sure what we did. Forgot to ask. Involved a reboot. It was on our side.

Thanks for your time.

On Sat, Sep 2, 2023 at 12:54 AM Merlin Nimier-David < @.***> wrote:

If you feel like this is a problem with DrJit, the error message would likely be much more informative using a debug build ( https://drjit.readthedocs.io/en/latest/firststeps-cpp.html#first-steps-in-c ).

— Reply to this email directly, view it on GitHub https://github.com/mitsuba-renderer/drjit/issues/185#issuecomment-1703757617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSJZPDH5GIM4VIKEVO7TTXYLQ4RANCNFSM6AAAAAA4E4URAQ . You are receiving this because you modified the open/close state.Message ID: @.***>

merlinND commented 1 year ago

Glad to hear it! Thank you for reporting back.

mitsuba-renderer / drjit

Core dump when CUDA_VISIBLE_DEVICES set to 0. #185