mitsuba-renderer / mitsuba3

Mitsuba 3: A Retargetable Forward and Inverse Renderer
https://www.mitsuba-renderer.org/
Other
2.09k stars 245 forks source link

CUDA compilation fails with driver v535 #826

Closed gibacic closed 5 months ago

gibacic commented 1 year ago

Summary

Mitsuba fails when trying to use CUDA varients in unit tests and tutorials with no obvious reason why. Similar to this previous closed Mitsuba issue 803 and its corresponding Dr.Jit issue 165. The issue manifests in two ways: (1) in the unit tests while running pytest and (2) when trying to run any Jupyter notebook tutorials that use CUDA variants.

System configuration

System information:
  OS: Arch Linux x86_64
  Kernel: 6.4.5-arch1-1
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  GPU: NVIDIA GeForce GTX 1070 Ti
  Python: 3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429]
  NVidia driver: 535.86.05
  CUDA: 12.2.91
  LLVM: 15.0.7

  Dr.Jit: 0.4.2
  Mitsuba: 3.3.0
     Is custom build? True
     Compiled with: Clang 15.0.7
     Variants:
        scalar_rgb
        scalar_spectral
        cuda_ad_rgb
        llvm_ad_rgb

Description

Mitsuba worked properly on this machine about two weeks ago but is now broken since I've updated my GPU drivers. According to this previous Dr.Jit issue 165, it can be fixed by compiling from source using the newest master branch of Dr.Jit. In issue 803, @njroussel said the latest commits for Dr.Jit were in the Mitsuba master branch, but the issue poster said they had to pull the master from the Dr.Jit branch themselves. The issue presents exactly the same if I compile the vanilla master branch or replace the Dr.Jit component with a fresh pull. That is why my build is listed as "custom" according to the system information report above.

Steps to reproduce

1. Fatal error while running pytest

From freshly compiled source, running pytest -v --full-trace dies and gives the following message (relevant tail only):

[...]
integrators/tests/test_ad_integrators.py::test01_rendering_primal[llvm_ad_rgb-direct_reparam-TranslateCameraConfig] PASSED    [ 30%]
integrators/tests/test_ad_integrators.py::test02_rendering_forward[cuda_ad_rgb-path-DiffuseAlbedoConfig] Fatal Python error: Aborted

Current thread 0x000014ec077d8740 (most recent call first):
  File "/home/goran/mitsuba3/build/python/mitsuba/python/ad/integrators/common.py", line 1050 in render_forward
  File "/home/goran/mitsuba3/src/integrators/tests/test_ad_integrators.py", line 802 in test02_rendering_forward
  File "/usr/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/python.py", line 1788 in runtest
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 324 in _main
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/config/__init__.py", line 166 in main
  File "/usr/lib/python3.11/site-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/usr/bin/pytest", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 13)
Aborted (core dumped)

2. Jupyter notebook tutorials using CUDA varients

In any jupyter notebook using a CUDA variant, the kernel will die after saying the jit_optix_compile() failed then dumping a bunch of info. For example, if I try to run caustics_optimization.ipynb, I get (relevant head only):

jit_optix_compile(): optixPipelineCreate() failed. Please see the PTX assembly listing and error message below:

.version 6.0
.target sm_50
.address_size 64

.const .align 8 .b8 params[1728];

.entry __raygen__54e86f123609eb4195c8cf51a9ebc68a() {
    .reg.b8   %b <1624>; .reg.b16 %w<1624>; .reg.b32 %r<1624>;
    .reg.b64  %rd<1624>; .reg.f32 %f<1624>; .reg.f64 %d<1624>;
    .reg.pred %p <1624>;

    call (%r0), _optix_get_launch_index_x, ();
    ld.const.u32 %r1, [params + 4];
    add.u32 %r0, %r0, %r1;
[...]

The rest of what is dumped into the console is attached separately due to its extreme length. caustics_optimization.log

Thanks for the help in advance, g

njroussel commented 1 year ago

Hi @gibacic

Since we initially discovered the two issues you referenced, I have ran some additional experiments and now firmly believe that we have a severe issue with the new Nvidia drivers.

For now the recommendation is to hold back on any Nvidia GPU driver update to v535.

I have not had the time to fully dig into this yet. It might be something quite small, or there might be a lot more to uncover, time will tell. The good news is that this regression is definitely due to the driver update. I believe there are two possibilites:

Thank you for your detailed reproducer and logs. Surprisingly, I cannot replicate them. Both the test_ad_integrators.py tests and the tutorial work perfectly fine on two different machines (both on v535). However, some other tests in test_renders.py do fail with similar error messages. In order to get the full logs with pytest, you must also pass the -s flag. Could you please provide us with the logs for those failing tests? Maybe we'll find the common denominator.

gibacic commented 1 year ago

For now the recommendation is to hold back on any Nvidia GPU driver update to v535. The good news is that this regression is definitely due to the driver update.

That's encouraging. I'll try rolling back the drivers to get it working again. I was really enjoying Mitsuba and look forward to getting back into it.

In order to get the full logs with pytest, you must also pass the -s flag. Could you please provide us with the logs for those failing tests? Maybe we'll find the common denominator.

Here are resulting outputs from the tests in the original post with -s, and -sv. pytest-s.log pytest-sv.log pytest-s-integrators-ad.log pytest-sv-integrators-ad.log

However, some other tests in test_renders.py do fail with similar error messages.

I also get the same error with integrators/tests/test_ad_integrators.py and render/tests/test_ad.py. pytest-s-renders.log pytest-s-render-ad.log

But render/tests/bsdf.py gives a segmentation fault and illegal (?) characters in the output: pytest-s-render-bsdf.log

    $ pytest -s ../src/render/tests/test_bsdf.py 
Running the full test suite. To skip slow tests, please run 'pytest -m "not slow"' 
======================================================= test session starts ========================================================
platform linux -- Python 3.11.3, pytest-7.4.0, pluggy-1.0.0
rootdir: /home/goran/mitsuba3
configfile: pyproject.toml
plugins: xdist-3.3.1, anyio-3.7.1
collected 4 items                                                                                                                  

../src/render/tests/test_bsdf.py ....

========================================================= warnings summary =========================================================
src/render/tests/test_bsdf.py::test01_ctx_construct
  /home/goran/mitsuba3/src/render/tests/test_bsdf.py:10: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of -1 to uint32 will fail in the future.
  For the old behavior, usually:
      np.array(value).astype(dtype)
  will give the desired result (the cast overflows).
    assert ctx.component == np.uint32(-1)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================== 4 passed, 1 warning in 0.15s ===================================================
jit_shutdown(): detected variable leaks:
 - variable r80 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="", dep=[0, 0, 0, 0])
 - variable r82 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="333?rS", dep=[0, 0, 0, 0])
 - variable r81 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="��?`):
        Number of auxiliary rays to trace when performing the coQ", dep=[0, 0, 0, 0])
Segmentation fault (core dumped)

I'll try some other tests after integrators to see which ones fail and report back. Hopefully a pattern will emerge.

Surprisingly, I cannot replicate them. Both the test_ad_integrators.py tests and the tutorial work perfectly fine on two different machines (both on v535).

Unfortunately, I've only got this one old machine to work with right now.

njroussel commented 1 year ago

This has (finally) been identified and fixed in https://github.com/mitsuba-renderer/drjit-core/commit/883e26eb9134d7b8bf4953514c8284365800af3c and been propagated to Mitsuba 3's master branch.

Thank you for the detailed logs!

gibacic commented 1 year ago

Thanks for continuing to look into this issue. Sorry about not posting more findings, I was on vacation and just got back to work yesterday. Unfortunately, I am still experiencing the same issues despite the updates to DrJIT and Mitsuba (fresh build from source). It appears to do the same thing when installed via pip.

System information:
  OS: Arch Linux x86_64
  Kernel: 6.4.12-arch1-1
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  GPU: NVIDIA GeForce GTX 1070 Ti
  Python: 3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429]
  NVidia driver: 535.104.05
  CUDA: 12.2.91
  LLVM: 15.0.7

  Dr.Jit: 0.4.2
  Mitsuba: 3.3.0
     Is custom build? True
     Compiled with: Clang 15.0.7
     Variants:
        scalar_rgb
        scalar_spectral
        cuda_ad_rgb
        llvm_ad_rgb

Here is the output of the test for the integrators that failed before: pytest-sv-integrators-ad.log. Here's the relevant tail of the above log for convenience:

COMPILE ERROR: failed to create pipeline
Info: Pipeline statistics
        module(s)                            :     2
        entry function(s)                    :    21
        trace call(s)                        :     2
        continuation callable call(s)        :     0
        direct callable call(s)              :     5
        basic block(s) in entry functions    :   401
        instruction(s) in entry functions    : 11324
        non-entry function(s)                :     0
        basic block(s) in non-entry functions:     0
        instruction(s) in non-entry functions:     0
        debug information                    :    no

Critical Dr.Jit compiler failure: jit_optix_check(): API error 7251 (OPTIX_ERROR_PIPELINE_LINK_ERROR): "Pipeline link error" in /home/goran/mitsuba3/ext/drjit/ext/drjit-core/src/optix_core.cpp:392.
Fatal Python error: Aborted

Current thread 0x0000147c1128c740 (most recent call first):
  File "/home/goran/mitsuba3/src/integrators/tests/test_ad_integrators.py", line 802 in test02_rendering_forward
  File "/usr/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/python.py", line 1788 in runtest
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/usr/lib/python3.11/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 324 in _main
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/usr/lib/python3.11/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/usr/lib/python3.11/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/usr/lib/python3.11/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/usr/lib/python3.11/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/usr/lib/python3.11/site-packages/_pytest/config/__init__.py", line 166 in main
  File "/usr/lib/python3.11/site-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/usr/bin/pytest", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 13)
Aborted (core dumped)

I also still get weird illegal characters in the output and a jit_shutdown(): detected variable leaks error when running the test_bsdf.py tests:

    $ pytest -sv ../src/render/tests/test_bsdf.py 
Running the full test suite. To skip slow tests, please run 'pytest -m "not slow"' 
================================================= test session starts =================================================
platform linux -- Python 3.11.3, pytest-7.4.0, pluggy-1.2.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /home/goran/mitsuba3
configfile: pyproject.toml
plugins: xdist-3.3.1, anyio-3.7.1
collected 4 items                                                                                                     

../src/render/tests/test_bsdf.py::test01_ctx_construct PASSED
../src/render/tests/test_bsdf.py::test02_bs_construct PASSED
../src/render/tests/test_bsdf.py::test03_bsdf_attributes[llvm_ad_rgb] PASSED
../src/render/tests/test_bsdf.py::test03_bsdf_attributes[cuda_ad_rgb] PASSED

================================================== warnings summary ===================================================
src/render/tests/test_bsdf.py::test01_ctx_construct
  /home/goran/mitsuba3/src/render/tests/test_bsdf.py:10: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of -1 to uint32 will fail in the future.
  For the old behavior, usually:
      np.array(value).astype(dtype)
  will give the desired result (the cast overflows).
    assert ctx.component == np.uint32(-1)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================ 4 passed, 1 warning in 0.14s =============================================
jit_shutdown(): detected variable leaks:
 - variable r80 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="", dep=[0, 0, 0, 0])
 - variable r82 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="333?   setp.ge.u32 %p0, %r0, %r2;
    @%p0 bra done;

    mov.u32 %r3, %�", dep=[0, 0, 0, 0])
 - variable r81 is still being referenced! (ref=1, ref_se=0, type=float32, size=1, stmt="��?}H��b�4@Y�b�LHY�b�4H\�b�lJY������H�� ", dep=[0, 0, 0, 0])
Segmentation fault (core dumped)

Let me know if running some more specific tests would help pin down this issue or if it is unrelated to the one opened this issue originally.

Thanks again :)

njroussel commented 1 year ago

Could you try with the latest PyPI packages? python -c "ìmport drjit; print(drjit.__version__)" should print 0.4.3 and mitsuba is now at 3.4.0. Also, could you clear your cache folders ~/.drjit and ~/.nv ?

This might be a similar error to the one we fixed. Unfortunatly, I can't reproduce a single error or your weird warnings on any of the machines that are available to me :slightly_frowning_face:

gibacic commented 1 year ago

OK just reinstalled the new version via pip and cleared the caches, but still get the same errors with the new release from PyPI. Here's what it spit out: caustics_optimization.log

I don't know how to run the tests with the PyPI package, is it possible?

In the meantime, I'll build mitsuba 3.4.0 from source and see if that gives any more info. EDIT: the build fails, it keeps complaining that #include <cstdint> is missing from different files. I'll wait to hear back from you before doing anything else.

Later today I'll try installing it on Windows on this machine to see if it's a hardware problem on my end. Considering the age of this system, it may be the root cause.

njroussel commented 1 year ago

I don't know how to run the tests with the PyPI package, is it possible?

You should be able to just run pytest : pytest -sv /path/to/mitsuba/src/render/tests/test_bsdf.py. Just make sure you haven't use build/setpath.sh before hand, so that it is indeed the PyPI package that is used.

it keeps complaining that #include is missing from different files

GCC 13 has introduced some breaking changes. Use a different compiler if you can.

gibacic commented 1 year ago

OK I've run the tests with the updated package from PyPI and it still fails at the same test. Here's the truncated output:

src/integrators/tests/test_ad_integrators.py::test02_rendering_forward[cuda_ad_rgb-path-DiffuseAlbedoConfig] jit_optix_compile(): optixPipelineCreate() failed. Please see the PTX assembly listing and error message below:
[...]
Critical Dr.Jit compiler failure: jit_optix_check(): API error 7251 (OPTIX_ERROR_PIPELINE_LINK_ERROR): "Pipeline link error" in /project/ext/drjit-core/src/optix_core.cpp:392.

Which is the same as above.

I'll try it on Windows and report back.

EDIT: Same thing on Windows. I now don't see any reason to believe this isn't hardware related. I.e., OptiX isn't supported on this card (GTX 1070ti) anymore. Thanks again for all the help, I guess I'll just have to find a newer machine to work with.

gibacic commented 1 year ago

I've got my hands on an RTX 2080ti and I can confirm that all the above problems cannot be reproduced on newer hardware. I've also seen online elsewhere that Blender and other renderers that use OptiX have unreliable performance on GTX cards. Maybe a caveat should be added to that only RTX cards and newer are supported?

akdd11 commented 1 year ago

I can confirm the same issue for another old GPU (Tesla P100), even though I using the most recent mitsuba and drjit version. A driver downgrade to <v535 solved the issue for me as well 😊

njroussel commented 1 year ago

Oh, sorry I missed the last two replies here.

I have a GTX970 which seemed to run just fine on the latest release with v535. I can give it a go again, maybe I made a mistake.

njroussel commented 1 year ago

Re-openeing as another case was found in #967. Again with an "older" GPU.

Quick-note: I never said it explicitly, but there shouldn't be any feature in Mitsuba that requires newer GPUs currently. At the very least it might require a "recent" driver.

tstigen commented 10 months ago

Can confirm that rolling back to GeForce 531.79 enabled it to work again. That is the last r525 driver I believe. It seems all the r535 drivers through 546.XX do not work with my 1080.

ookBook commented 7 months ago

Confirming what @tstigen wrote, except I wasn't able to roll back and am on driver 550.54

When I tried mistuba-tutorial/inverse_rendering/shape_optimization.ipynb, I got the error

Critical Dr.Jit compiler failure: jit_cuda_compile(): compilation failed. Please see the PTX assembly listing and error message below:

.version 6.0
.target sm_50
.address_size 64

more here. Error message is actually from this file, which is just a copy of the code blocks from shape_optimization.ipynb)

Then when I tried to run caustics_optimization.ipynb, I got the same error about Optix

OS: Ubuntu 22.04.4 LTS Kernel: 6.5.0-27-generic CPU: Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz Python: 3.10.12 Mitsuba: 3.5.0 drjit: 0.4.4 nvidia-smi:

Thu Apr 18 22:40:01 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   60C    P2             53W /  280W |    2433MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
...

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda
 compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

llvm: 14.0.0

I didn't attempt to build mitsuba from source. I simply installed it via pip install mitsuba and couldn't get it to run

wjakob commented 7 months ago

@ookBook Are you using a double precision variant, by any chance? The code seems to be using double-precision reductions that are not supported on your GPU.

ookBook commented 7 months ago

I've only used what's in the tutorials, so

  1. cuda_ad_rgb
  2. llvm_ad_rgb

Per the documentation, it seems like this is single precision?

njroussel commented 5 months ago

Hi @ookBook

Indeed, the tutorials don't run in double precision but in the PTX that you've sent there are some double precision instructions. My best guess is that you're running into some case which uses an invalidated pointer at which point anything can happen. Can you try compiling the project yourself ? It has a few new fixes. Make sure to pip uninstall mitsuba drjit too.

This is deviating from the original topic. There has been a significant refactor to these parts of the codebase anyway. I'll close this issue and open a new one if I find anything wrong with my GTX 970 on the upcoming nanobind-version of Mitsuba.

ookBook commented 5 months ago

Thanks, @njroussel! I'll try it out

psylvan commented 5 months ago

Again I meet the same problem:

Critical Dr.Jit compiler failure: jit_optix_check(): API error 7251 (OPTIX_ERROR_PIPELINE_LINK_ERROR)

as @gibacic report, except my GPU is RTX 3060 and I tried with driver version 535.171 and 525.147. It seems downgrade the driver to 525 dosen't work for me. Meanwile, some time I ran into another runtime error:

CUDA error: an illegal memory access was encountered

What I'm doing:

I'm trying to add mitsuba ray tracing to a pytorch pipeline with @dr.wrap(), a few epochs at the begining gose well but then the error occurs when the backward propagation start by line loss.backward()

Some more information:

njroussel commented 5 months ago

@psylvan Please open a new issue. Your problem seems unrelated to the original post: you're running into issues with the old drivers too.