Closed jrounds closed 1 year ago
Hello @jrounds,
What are the models of the two GPUs, is there anything different between them? What is your NVIDIA driver version?
I routinely select GPU 0 or 1 using CUDA_VISIBLE_DEVICES
and haven't encountered this crash.
Just checking, does the presence of Tensorflow in the Conda env influence the crash?
E.g. if you just create a Python virtualenv and pip install drjit
, can you reproduce the crash?
I think we started speculating there may be something on gpu0 that matters. I didn't investigate that. core dump is a less than ideal message, but it may be user error. I don't see what it could be though. Certainly nvidia-smi is not giving a clue.
GPUs: 2x Quadro RTX 6000 Driver: 535.104.05
I am going to close this because I have an effect work around and I am not convinced it isnt our machine.
If you feel like this is a problem with DrJit, the error message would likely be much more informative using a debug build (https://drjit.readthedocs.io/en/latest/firststeps-cpp.html#first-steps-in-c).
We ended up fixing it. Actually not sure what we did. Forgot to ask. Involved a reboot. It was on our side.
Thanks for your time.
On Sat, Sep 2, 2023 at 12:54 AM Merlin Nimier-David < @.***> wrote:
If you feel like this is a problem with DrJit, the error message would likely be much more informative using a debug build ( https://drjit.readthedocs.io/en/latest/firststeps-cpp.html#first-steps-in-c ).
— Reply to this email directly, view it on GitHub https://github.com/mitsuba-renderer/drjit/issues/185#issuecomment-1703757617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFSJZPDH5GIM4VIKEVO7TTXYLQ4RANCNFSM6AAAAAA4E4URAQ . You are receiving this because you modified the open/close state.Message ID: @.***>
Glad to hear it! Thank you for reporting back.
Hi,
Host is redhat os with two nvidia GPUs.
I built an environment like this everything works (CPU)
Add in tensorflow/cuda according to published instructions:
Still good this all works as expected
STarted messing with CUDA_VISIBLE_DEVICES (actually started at the last one, but showing the ones that work first)
This works (2 GPUS)
This works (Last GPU)
This core dumps (repeatable in any pattern of these combinations)
Actual output
So setting CUDA_VISIBLE_DEVICES to first gpu of 2 results in core dump of on import of drjit, but setting it to last does not have that effect?
Any advice on what to consider to work through this?