Closed gibacic closed 5 months ago
Hi @gibacic
Since we initially discovered the two issues you referenced, I have ran some additional experiments and now firmly believe that we have a severe issue with the new Nvidia drivers.
For now the recommendation is to hold back on any Nvidia GPU driver update to v535.
I have not had the time to fully dig into this yet. It might be something quite small, or there might be a lot more to uncover, time will tell. The good news is that this regression is definitely due to the driver update. I believe there are two possibilites:
Thank you for your detailed reproducer and logs. Surprisingly, I cannot replicate them. Both the test_ad_integrators.py
tests and the tutorial work perfectly fine on two different machines (both on v535). However, some other tests in test_renders.py
do fail with similar error messages.
In order to get the full logs with pytest
, you must also pass the -s
flag. Could you please provide us with the logs for those failing tests? Maybe we'll find the common denominator.
For now the recommendation is to hold back on any Nvidia GPU driver update to v535. The good news is that this regression is definitely due to the driver update.
That's encouraging. I'll try rolling back the drivers to get it working again. I was really enjoying Mitsuba and look forward to getting back into it.
In order to get the full logs with pytest, you must also pass the -s flag. Could you please provide us with the logs for those failing tests? Maybe we'll find the common denominator.
Here are resulting outputs from the tests in the original post with -s
, and -sv
.
pytest-s.log
pytest-sv.log
pytest-s-integrators-ad.log
pytest-sv-integrators-ad.log
However, some other tests in test_renders.py do fail with similar error messages.
I also get the same error with integrators/tests/test_ad_integrators.py
and render/tests/test_ad.py
.
pytest-s-renders.log
pytest-s-render-ad.log
But render/tests/bsdf.py
gives a segmentation fault and illegal (?) characters in the output:
pytest-s-render-bsdf.log
$ pytest -s ../src/render/tests/test_bsdf.py
Running the full test suite. To skip slow tests, please run 'pytest -m "not slow"'
======================================================= test session starts ========================================================
platform linux -- Python 3.11.3, pytest-7.4.0, pluggy-1.0.0
rootdir: /home/goran/mitsuba3
configfile: pyproject.toml
plugins: xdist-3.3.1, anyio-3.7.1
collected 4 items
../src/render/tests/test_bsdf.py ....
========================================================= warnings summary =========================================================
src/render/tests/test_bsdf.py::test01_ctx_construct
/home/goran/mitsuba3/src/render/tests/test_bsdf.py:10: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -1 to uint32 will fail in the future.
For the old behavior, usually:
np.array(value).astype(dtype)
will give the desired result (the cast overflows).
assert ctx.component == np.uint32(-1)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================== 4 passed, 1 warning in 0.15s ===================================================
jit_shutdown(): detected variable leaks:
- variable r80 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="", dep=[0, 0, 0, 0])
- variable r82 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="333?rS", dep=[0, 0, 0, 0])
- variable r81 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="��?`):
Number of auxiliary rays to trace when performing the coQ", dep=[0, 0, 0, 0])
Segmentation fault (core dumped)
I'll try some other tests after integrators
to see which ones fail and report back. Hopefully a pattern will emerge.
Surprisingly, I cannot replicate them. Both the test_ad_integrators.py tests and the tutorial work perfectly fine on two different machines (both on v535).
Unfortunately, I've only got this one old machine to work with right now.
This has (finally) been identified and fixed in https://github.com/mitsuba-renderer/drjit-core/commit/883e26eb9134d7b8bf4953514c8284365800af3c and been propagated to Mitsuba 3's master
branch.
Thank you for the detailed logs!
Thanks for continuing to look into this issue. Sorry about not posting more findings, I was on vacation and just got back to work yesterday. Unfortunately, I am still experiencing the same issues despite the updates to DrJIT and Mitsuba (fresh build from source). It appears to do the same thing when installed via pip
.
System information:
OS: Arch Linux x86_64
Kernel: 6.4.12-arch1-1
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
GPU: NVIDIA GeForce GTX 1070 Ti
Python: 3.11.3 (main, Jun 5 2023, 09:32:32) [GCC 13.1.1 20230429]
NVidia driver: 535.104.05
CUDA: 12.2.91
LLVM: 15.0.7
Dr.Jit: 0.4.2
Mitsuba: 3.3.0
Is custom build? True
Compiled with: Clang 15.0.7
Variants:
scalar_rgb
scalar_spectral
cuda_ad_rgb
llvm_ad_rgb
Here is the output of the test for the integrators that failed before: pytest-sv-integrators-ad.log. Here's the relevant tail of the above log for convenience:
COMPILE ERROR: failed to create pipeline
Info: Pipeline statistics
module(s) : 2
entry function(s) : 21
trace call(s) : 2
continuation callable call(s) : 0
direct callable call(s) : 5
basic block(s) in entry functions : 401
instruction(s) in entry functions : 11324
non-entry function(s) : 0
basic block(s) in non-entry functions: 0
instruction(s) in non-entry functions: 0
debug information : no
Critical Dr.Jit compiler failure: jit_optix_check(): API error 7251 (OPTIX_ERROR_PIPELINE_LINK_ERROR): "Pipeline link error" in /home/goran/mitsuba3/ext/drjit/ext/drjit-core/src/optix_core.cpp:392.
Fatal Python error: Aborted
Current thread 0x0000147c1128c740 (most recent call first):
File "/home/goran/mitsuba3/src/integrators/tests/test_ad_integrators.py", line 802 in test02_rendering_forward
File "/usr/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/usr/lib/python3.11/site-packages/_pytest/python.py", line 1788 in runtest
File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 262 in <lambda>
File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 222 in call_and_report
File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 133 in runtestprotocol
File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 324 in _main
File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 270 in wrap_session
File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/usr/lib/python3.11/site-packages/_pytest/config/__init__.py", line 166 in main
File "/usr/lib/python3.11/site-packages/_pytest/config/__init__.py", line 189 in console_main
File "/usr/bin/pytest", line 8 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 13)
Aborted (core dumped)
I also still get weird illegal characters in the output and a jit_shutdown(): detected variable leaks
error when running the test_bsdf.py
tests:
$ pytest -sv ../src/render/tests/test_bsdf.py
Running the full test suite. To skip slow tests, please run 'pytest -m "not slow"'
================================================= test session starts =================================================
platform linux -- Python 3.11.3, pytest-7.4.0, pluggy-1.2.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /home/goran/mitsuba3
configfile: pyproject.toml
plugins: xdist-3.3.1, anyio-3.7.1
collected 4 items
../src/render/tests/test_bsdf.py::test01_ctx_construct PASSED
../src/render/tests/test_bsdf.py::test02_bs_construct PASSED
../src/render/tests/test_bsdf.py::test03_bsdf_attributes[llvm_ad_rgb] PASSED
../src/render/tests/test_bsdf.py::test03_bsdf_attributes[cuda_ad_rgb] PASSED
================================================== warnings summary ===================================================
src/render/tests/test_bsdf.py::test01_ctx_construct
/home/goran/mitsuba3/src/render/tests/test_bsdf.py:10: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -1 to uint32 will fail in the future.
For the old behavior, usually:
np.array(value).astype(dtype)
will give the desired result (the cast overflows).
assert ctx.component == np.uint32(-1)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================ 4 passed, 1 warning in 0.14s =============================================
jit_shutdown(): detected variable leaks:
- variable r80 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="", dep=[0, 0, 0, 0])
- variable r82 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="333? setp.ge.u32 %p0, %r0, %r2;
@%p0 bra done;
mov.u32 %r3, %�", dep=[0, 0, 0, 0])
- variable r81 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="��?}H��b�4@Y�b�LHY�b�4H\�b�lJY������H�� ", dep=[0, 0, 0, 0])
Segmentation fault (core dumped)
Let me know if running some more specific tests would help pin down this issue or if it is unrelated to the one opened this issue originally.
Thanks again :)
Could you try with the latest PyPI packages?
python -c "ìmport drjit; print(drjit.__version__)"
should print 0.4.3
and mitsuba
is now at 3.4.0
.
Also, could you clear your cache folders ~/.drjit
and ~/.nv
?
This might be a similar error to the one we fixed. Unfortunatly, I can't reproduce a single error or your weird warnings on any of the machines that are available to me :slightly_frowning_face:
OK just reinstalled the new version via pip
and cleared the caches, but still get the same errors with the new release from PyPI. Here's what it spit out: caustics_optimization.log
I don't know how to run the tests with the PyPI package, is it possible?
In the meantime, I'll build mitsuba 3.4.0
from source and see if that gives any more info. EDIT: the build fails, it keeps complaining that #include <cstdint>
is missing from different files. I'll wait to hear back from you before doing anything else.
Later today I'll try installing it on Windows on this machine to see if it's a hardware problem on my end. Considering the age of this system, it may be the root cause.
I don't know how to run the tests with the PyPI package, is it possible?
You should be able to just run pytest
: pytest -sv /path/to/mitsuba/src/render/tests/test_bsdf.py
. Just make sure you haven't use build/setpath.sh
before hand, so that it is indeed the PyPI package that is used.
it keeps complaining that #include
is missing from different files
GCC 13 has introduced some breaking changes. Use a different compiler if you can.
OK I've run the tests with the updated package from PyPI and it still fails at the same test. Here's the truncated output:
src/integrators/tests/test_ad_integrators.py::test02_rendering_forward[cuda_ad_rgb-path-DiffuseAlbedoConfig] jit_optix_compile(): optixPipelineCreate() failed. Please see the PTX assembly listing and error message below:
[...]
Critical Dr.Jit compiler failure: jit_optix_check(): API error 7251 (OPTIX_ERROR_PIPELINE_LINK_ERROR): "Pipeline link error" in /project/ext/drjit-core/src/optix_core.cpp:392.
Which is the same as above.
I'll try it on Windows and report back.
EDIT: Same thing on Windows. I now don't see any reason to believe this isn't hardware related. I.e., OptiX isn't supported on this card (GTX 1070ti) anymore. Thanks again for all the help, I guess I'll just have to find a newer machine to work with.
I've got my hands on an RTX 2080ti and I can confirm that all the above problems cannot be reproduced on newer hardware. I've also seen online elsewhere that Blender and other renderers that use OptiX have unreliable performance on GTX cards. Maybe a caveat should be added to that only RTX cards and newer are supported?
I can confirm the same issue for another old GPU (Tesla P100), even though I using the most recent mitsuba
and drjit
version. A driver downgrade to <v535 solved the issue for me as well 😊
Oh, sorry I missed the last two replies here.
I have a GTX970 which seemed to run just fine on the latest release with v535. I can give it a go again, maybe I made a mistake.
Re-openeing as another case was found in #967. Again with an "older" GPU.
Quick-note: I never said it explicitly, but there shouldn't be any feature in Mitsuba that requires newer GPUs currently. At the very least it might require a "recent" driver.
Can confirm that rolling back to GeForce 531.79 enabled it to work again. That is the last r525 driver I believe. It seems all the r535 drivers through 546.XX do not work with my 1080.
Confirming what @tstigen wrote, except I wasn't able to roll back and am on driver 550.54
When I tried mistuba-tutorial/inverse_rendering/shape_optimization.ipynb, I got the error
Critical Dr.Jit compiler failure: jit_cuda_compile(): compilation failed. Please see the PTX assembly listing and error message below:
.version 6.0
.target sm_50
.address_size 64
more here. Error message is actually from this file, which is just a copy of the code blocks from shape_optimization.ipynb)
Then when I tried to run caustics_optimization.ipynb, I got the same error about Optix
OS: Ubuntu 22.04.4 LTS
Kernel: 6.5.0-27-generic
CPU: Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
Python: 3.10.12
Mitsuba: 3.5.0
drjit: 0.4.4
nvidia-smi
:
Thu Apr 18 22:40:01 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 60C P2 53W / 280W | 2433MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
...
nvcc --version
:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda
compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
llvm: 14.0.0
I didn't attempt to build mitsuba from source. I simply installed it via pip install mitsuba
and couldn't get it to run
@ookBook Are you using a double precision variant, by any chance? The code seems to be using double-precision reductions that are not supported on your GPU.
I've only used what's in the tutorials, so
Per the documentation, it seems like this is single precision?
Hi @ookBook
Indeed, the tutorials don't run in double precision but in the PTX that you've sent there are some double precision instructions. My best guess is that you're running into some case which uses an invalidated pointer at which point anything can happen.
Can you try compiling the project yourself ? It has a few new fixes. Make sure to pip uninstall mitsuba drjit
too.
This is deviating from the original topic. There has been a significant refactor to these parts of the codebase anyway. I'll close this issue and open a new one if I find anything wrong with my GTX 970 on the upcoming nanobind-version of Mitsuba.
Thanks, @njroussel! I'll try it out
Again I meet the same problem:
Critical Dr.Jit compiler failure: jit_optix_check(): API error 7251 (OPTIX_ERROR_PIPELINE_LINK_ERROR)
as @gibacic report, except my GPU is RTX 3060 and I tried with driver version 535.171 and 525.147. It seems downgrade the driver to 525 dosen't work for me. Meanwile, some time I ran into another runtime error:
CUDA error: an illegal memory access was encountered
What I'm doing:
I'm trying to add mitsuba ray tracing to a pytorch pipeline with @dr.wrap()
, a few epochs at the begining gose well but then the error occurs when the backward propagation start by line loss.backward()
Some more information:
pip
@psylvan Please open a new issue. Your problem seems unrelated to the original post: you're running into issues with the old drivers too.
Summary
Mitsuba fails when trying to use CUDA varients in unit tests and tutorials with no obvious reason why. Similar to this previous closed Mitsuba issue 803 and its corresponding Dr.Jit issue 165. The issue manifests in two ways: (1) in the unit tests while running
pytest
and (2) when trying to run any Jupyter notebook tutorials that use CUDA variants.System configuration
Description
Mitsuba worked properly on this machine about two weeks ago but is now broken since I've updated my GPU drivers. According to this previous Dr.Jit issue 165, it can be fixed by compiling from source using the newest master branch of Dr.Jit. In issue 803, @njroussel said the latest commits for Dr.Jit were in the Mitsuba master branch, but the issue poster said they had to pull the master from the Dr.Jit branch themselves. The issue presents exactly the same if I compile the vanilla master branch or replace the Dr.Jit component with a fresh pull. That is why my build is listed as "custom" according to the system information report above.
Steps to reproduce
1. Fatal error while running
pytest
From freshly compiled source, running
pytest -v --full-trace
dies and gives the following message (relevant tail only):2. Jupyter notebook tutorials using CUDA varients
In any jupyter notebook using a CUDA variant, the kernel will die after saying the
jit_optix_compile()
failed then dumping a bunch of info. For example, if I try to runcaustics_optimization.ipynb
, I get (relevant head only):The rest of what is dumped into the console is attached separately due to its extreme length. caustics_optimization.log
Thanks for the help in advance, g