mitsuba-renderer / drjit

Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering
BSD 3-Clause "New" or "Revised" License
586 stars 42 forks source link

Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS) #125

Open maxfrei750 opened 1 year ago

maxfrei750 commented 1 year ago

I'm currently running into an issue, where I receive the following error:

Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.
Aborted (core dumped)

As context: I'm using pytorch in conjunction with mitsuba in a GAN style approach. Of course, I'm happy to provide more context, if deemed helpful.

Unfortunately, my code is rather large, and a small reproduction script will not be easy to come up with. So I'm hoping that we can maybe sort it out using e.g. debugging tools.

I tried to take a stab at it with cudamemcheck but that didn't yield any results. I also tried compiling in debug mode, but unfortunately, I ran into a different issue (#124) there.

Is there anything else I can do, before #124 is resolved? And do you think that the issues could be related to me using clang 11, python 3.10 and cuda 11.7.1?

Thanks for taking the time.

njroussel commented 1 year ago

Hi @maxfrei750

We've had similar errors in the past which would only show up in some very specific setups. It's still likely that this is indeed an issue with some Dr.Jit internals.

This should (hopefully) not be related to your specific cuda/python/clang verisons. It's still worth up/down-grading them and testing your code, just in case. IIRC cuda-gdb is also the recommended tool for Optix debugging. Please report back with any progress, or smaller reproducers :)

maxfrei750 commented 1 year ago

Thanks for the advice. I tried debugging using cuda-gdb but it wouldn't attach to the python process. I then started the script from the regular gdb, but the results seem quite generic to me:

(gdb) run train.py --config-name real_sem_adversarial
Starting program: /usr/bin/python train.py --config-name real_sem_adversarial
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7f4e19061640 (LWP 3354)]
[New Thread 0x7f4e04cfd640 (LWP 3355)]
[New Thread 0x7f4e044fc640 (LWP 3356)]
[New Thread 0x7f4e03717640 (LWP 3357)]
[Thread 0x7f4e19061640 (LWP 3354) exited]
[New Thread 0x7f4e19061640 (LWP 3358)]
[New Thread 0x7f4dd1dff640 (LWP 3359)]
[New Thread 0x7f4dcf5fe640 (LWP 3360)]
[New Thread 0x7f4dccdfd640 (LWP 3361)]
[New Thread 0x7f4dca5fc640 (LWP 3362)]
[New Thread 0x7f4dc7dfb640 (LWP 3363)]
[New Thread 0x7f4dc55fa640 (LWP 3364)]
[New Thread 0x7f4dc2df9640 (LWP 3365)]
[New Thread 0x7f4dc05f8640 (LWP 3366)]
[New Thread 0x7f4dbddf7640 (LWP 3367)]
[New Thread 0x7f4db95f6640 (LWP 3368)]
[New Thread 0x7f4db6df5640 (LWP 3369)]
[New Thread 0x7f4db45f4640 (LWP 3370)]
[New Thread 0x7f4db1df3640 (LWP 3371)]
[New Thread 0x7f4daf5f2640 (LWP 3372)]
[New Thread 0x7f4dacdf1640 (LWP 3373)]
[New Thread 0x7f4dac5f0640 (LWP 3374)]
[New Thread 0x7f4da7def640 (LWP 3375)]
[New Thread 0x7f4da55ee640 (LWP 3376)]
[New Thread 0x7f4da2ded640 (LWP 3377)]
[New Thread 0x7f4da05ec640 (LWP 3378)]
[New Thread 0x7f4d9ddeb640 (LWP 3379)]
[New Thread 0x7f4d9b5ea640 (LWP 3380)]
[New Thread 0x7f4d1dfbc640 (LWP 3402)]
[New Thread 0x7f4d1d7bb640 (LWP 3403)]
[New Thread 0x7f4d18fba640 (LWP 3404)]
[New Thread 0x7f4d167b9640 (LWP 3405)]
[New Thread 0x7f4d13fb8640 (LWP 3406)]
[New Thread 0x7f4d117b7640 (LWP 3407)]
[New Thread 0x7f4d0efb6640 (LWP 3408)]
[New Thread 0x7f4d0c7b5640 (LWP 3409)]
[New Thread 0x7f4d09fb4640 (LWP 3410)]
[New Thread 0x7f4d097b3640 (LWP 3411)]
[New Thread 0x7f4d04fb2640 (LWP 3412)]
[New Thread 0x7f4d047b1640 (LWP 3413)]
[New Thread 0x7f4cfffb0640 (LWP 3414)]
[New Thread 0x7f4cff7af640 (LWP 3415)]
[New Thread 0x7f4cfafae640 (LWP 3416)]
[New Thread 0x7f4cfa7ad640 (LWP 3417)]
[New Thread 0x7f4cf5fac640 (LWP 3418)]
[New Thread 0x7f4cf37ab640 (LWP 3419)]
[New Thread 0x7f4cf0faa640 (LWP 3420)]
[New Thread 0x7f4cee7a9640 (LWP 3421)]
[New Thread 0x7f4cedfa8640 (LWP 3422)]
[New Thread 0x7f4ce97a7640 (LWP 3423)]
[New Thread 0x7f4ce6fa6640 (LWP 3424)]
[New Thread 0x7f4cdcba5640 (LWP 3425)]
[Detaching after vfork from child process 3426]
[Detaching after vfork from child process 3427]
[Detaching after vfork from child process 3428]
[Detaching after vfork from child process 3429]
[Detaching after vfork from child process 3430]
[Detaching after vfork from child process 3431]
[New Thread 0x7f4cdb033640 (LWP 3466)]
[Thread 0x7f4cdb033640 (LWP 3466) exited]
[New Thread 0x7f4cdb033640 (LWP 3467)]
[Detaching after vfork from child process 3476]
[New Thread 0x7f4cda832640 (LWP 3480)]
[New Thread 0x7f4cda031640 (LWP 3490)]
[New Thread 0x7f4cd9830640 (LWP 3491)]
[New Thread 0x7f4cd89f2640 (LWP 3492)]
[New Thread 0x7f4cb7bff640 (LWP 3501)]
  0%|                                                                                                                                        | 0/20000 [00:00<?, ?it/s][New Thread 0x7f4cb69e6640 (LWP 3516)]
[New Thread 0x7f4c9ffff640 (LWP 3517)]
[New Thread 0x7f4c9f7fe640 (LWP 3518)]
[New Thread 0x7f4c9effd640 (LWP 3519)]
[New Thread 0x7f4c9e7fc640 (LWP 3520)]
[New Thread 0x7f4c9dffb640 (LWP 3521)]
[New Thread 0x7f4c9d7fa640 (LWP 3522)]
[New Thread 0x7f4c99561640 (LWP 3523)]
[New Thread 0x7f4c98d60640 (LWP 3524)]
[New Thread 0x7f4c82bff640 (LWP 3525)]
[New Thread 0x7f4c33fff640 (LWP 3526)]
[New Thread 0x7f4c337fe640 (LWP 3527)]
[New Thread 0x7f4c32ffd640 (LWP 3528)]
[New Thread 0x7f4c327fc640 (LWP 3529)]
[New Thread 0x7f4c31ffb640 (LWP 3530)]
[New Thread 0x7f4c317fa640 (LWP 3544)]
[New Thread 0x7f4c30ff9640 (LWP 3557)]
  0%|                                                                                                                            | 18/20000 [00:57<17:25:26,  3.14s/it][New Thread 0x7f4b85fff640 (LWP 3998)]
  1%|▋                                                                                                                          | 115/20000 [06:13<17:55:03,  3.24s/it]

Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.

Thread 1 "python" received signal SIGABRT, Aborted.
0x00007f4e1a416a7c in pthread_kill () from /usr/lib/x86_64-linux-gnu/libc.so.6
maxfrei750 commented 1 year ago

I narrowed down the error a little further, and it occurs inside the step() function of the Adam optimizer. Since (apart from some BSDF parameters), I'm also optimizing a texture, I now set mask_updates=True (as recommended here). So far, the crash does no longer occur. At least not at an early iteration as before. Maybe it just postponed the problem, maybe it solved it. I'm well aware that this is practically unresolvable without the code at hand, so I'll close this issue for now. Of course, feel free to share your thoughts on this, if the new observations helped to narrow the problem down. Thanks!

njroussel commented 1 year ago

Thank you for the update. This is interesting, but also makes sense - somewhere there's an illegal read/write and mask_updates has just reduced the number of reads/writes.

Could you point out the integrator/BSDF/texture plugins you are using (if they are not custom)? At least that might help me narrow down similar issues in the future.

maxfrei750 commented 1 year ago

I already figured that I should write more about my setup:

I use an envmap and two principled BSDFs, all of which are optimized using Adam. Integrator is PRB. The optimization criterion depends on a neural network, so wrap_ad is involved as well. In between iterations, geometry is replaced. Just for the record: I'd be happy to grant you access to the code, but I feel that reviewing/debugging it would be more work than you could justify for the issue.

Unfortunately, the problem still persists, so I'm carrying out further experiments. However, reopening this issue didn't seem warranted to me, since I don't see how it would help the community, without a reproducer. If you think that it's worth a shot, then I'd be happy to try debugging it myself, but that would require some guidance regarding the required tools and the best strategy, since I'm not familiar with C/CUDA. So I feel that it is probably too much to ask for. :-)

merlinND commented 1 year ago

Hi @maxfrei750,

One idea that comes to mind: if you're integrating PyTorch and Mitsuba, you might need to add explicit synchronization points when crossing over from one to the other. Something like:

# ... some PyTorch computation
dr.sync_thread()

results = # ... Mitsuba computations
dr.eval(results)
dr.sync_thread()

results_pytorch = # ... go back to PyTorch world if needed

I believe there are some built-in protections in DrJit, but this particular use-case is one of the few where explicit synchronization is needed.

Edit: used to be needed, see Nicolas' response below.

maxfrei750 commented 1 year ago

@merlinND Thanks for the hint! I tried to add a dr.sync_thread() before and after the Mitsuba generator update (not yet before and after the PyTorch discriminator update). Unfortunately, it still crashed, but now with a new error:

Critical Dr.Jit compiler failure: cuda_check(): API error 0716 (CUDA_ERROR_MISALIGNED_ADDRESS): "misaligned address" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.
Aborted (core dumped)

Is that just a different symptom of the same problem, or did we make some progress there?

Debugging the problem is especially difficult, because it takes a few hours to reach the bug and I there is no python error stack. Is there a way to get one?

somewhere there's an illegal read/write and mask_updates has just reduced the number of reads/writes.

I had the idea to provoke the bug, to make debugging faster. Therefore, I tried to increase the number of read/writes, by drastically increasing the size of my envmap texture. Surprisingly, that did not result in it crashing any faster.

Would you recommend giving cuda-gdb another shot?

maxfrei750 commented 1 year ago

I did some more testing and found that older versions of mitsuba (and therefore drjit) work for me. To find the exact commit that causes the error, I'm currently doing a bisection.

So far, I narrowed the problem down to this range of commits:

204d89b2 Minor release v3.1.0          <- bad
deb5f644 [GHA] Remove authentication to checkout repository during wheel build (repository is now public)
a9277985 Increase PyPI version of DrJit, update drjit submodule
b0458bb6 Update release notes
df56a290 Update drjit (remove predicates)
b9a55c62 Minor fix in HDRFilm::develop()
c55e9596 Lower required Nvidia GPU compute capability
90b52dd6 Set correct CUDA contexts before OptiX operations when necessary
8ba8528a deprecated samples_per_pass parameter, removed oldpath.cpp
47a3958e Fix tests in test_cylinder and test_rectangle
7d846c60 Upgrade drjit submodule
5c071927 Revert "Remove make_opaque() in plugins"
b9e95b20 quench warning on recent Clang versions
e7cc1274 minor: quench warning on MSVC
d8db806a fix progress bar on windows
af06e8a9 Support large launches (#394)
28660f3a obj.cpp: don't use memory-mapped I/O on windows
56ea9100 fixed a few issues detected via MSVC warnings
c2285ad8 Update drjit submodule
119e8ffd Use BSDF::eval_pdf_sample() in path.cpp
9f2ec267 Update drjit submodule
815b30f2 Update drjit submodule
f3741593 Update drjit submodule
5a0f766c Remove make_opaque() in plugins
8aaa67be Update drjit submodule
c6b40c4f Add missing methods to Instance and ShapeGroup
0a677400 Update data submodule
cbc0f476 Fix deadlock when building nested scenes with parallel (LLVM)
df79cb3e Use jit_optix_overwrite_sbt to share SBT across two scenes.
c1bfd8f1 Fix eval_parameterization masking after ray tracing
18cf8fb4 Avoid virtual function call in Shape::eval_parameterization
241c85df Remove useless print in tests
98c4037e Add synchronisation in custom shape updates
b5d8c5dc Add unit tests for differentiable disk & rectangle
d8bb5a83 Fix sphere gradients with FollowShape
ec64b7cb Make Rectangle primitive differentiable
c8132071 Make Disk primitive differentiable
59faf47b Add DetachShape test to test_sphere.py
13feb6c5 Minor fixes in sphere.cpp and cylinder.cpp
54d2d3ab Make instances differentiable
29bcc757 Sphere to_world parameter is discontinuous
e0871aa8 Make ShapeGroup traversable and updatable
a598a897 Minor improvement in SceneParameters.__repr__
50df4c65 Make cylinder shape differentiable
f5dbedec Make sphere shape differentiable
685d0ead Update drjit submodule
2d773473 Resolve symbolic links when checking for drjit location. Fixes #302
9cc3faf9 Remove unwanted debug prints
a0101e97 Fix SceneParameters hashing for TensorXf in scalar modes
55293189 Check for driver version before running Optix Denoiser tests
28ec2cf3 Handle monocromatic variants with AOV alebdo
59af884e Implement diffuse reflectance for most BSDFs
31271c95 Change interface from pointers to references in OptixDenoiser
990c2f84 Add tests for temporal denoising and denoising with a multichannel bitmap
71ae82fb Add documentation for OptixDenoiser, improvements to API reference generation
0f7ed766 Guarantee that Tensors are evaluated before being passed to OptiX denoiser
7784c05d OptixDenoiser: take transform as argument to change normals to sensor frame
4635a386 Rename Denoiser to OptixDenoiser, only build it in CUDA modes
7c8d52e7 Use function call operator for OptiX Denoiser
ce4c8be3 Add tests for Denoiser constructor, simple denoising: nothing, albedo, normals
34826f2c Fix stub generation in edge case when default value contains brackets
6d838914 Add input argument validation for denoiser
a948a5c0 Fix ImageBlock documentation typo
72af6fc2 Implement OptiX denoiser for version 7.4
1323497f Created stub for the optix denoiser
ea513f73 Make other sensor parameters differentiable
fad031ac Add sample_direction to thinlens camera
14ef6362 Add unit test for translating camera grad
ef9f559e WIP make Perspective sensor parameters differential
d3a7580c Remove unnecessary make_opaque in principled BSDFs
61b9516a Fix mi.luminance() for monochromatic modes
4ebf700c Add bindings for PluginManager.create_object
e6d5e52f Fix missing clear_shapes_dirty()
47408fe6 Fix CUDA_ERROR_ILLEGAL_ADDRESS error        <- good

Since I now think that there might be a chance to find the cause of the problem, I'll reopen the issue for now.

@merlinND @njroussel If any of these commits seems especially promising to you (e.g. drjit updates), then please tell me, so that I can deviate from the default bisection strategy. Currently, I'm testing 685d0ead Update drjit submodule, since the commit at the exact half didn't compile for me. I'll keep you updated...

njroussel commented 1 year ago

I don't think explicit synchronization is ever need now - we're using the default CUDA stream for the current process in Dr.Jit which synchronizes/step-locks with other streams IIRC. I guess it's still worth trying, just in case this is also bugged :sweat:

The issue with cuda-gdb is that it would most likely point out a line in the compiled CUDA kernel which will most likely be very hard to re-map to the original mitsuba/drjit source code. The few times I used it, it was mostly just to confirm that there was indeed some bug in the generated kernels.

Indeed, I would assume this was introduced in a commit where we update the drjit submodule. Another possibilty are commits related to OptiX changes, for example df79cb3e to c6b40c4f (approximate range) change quite a few things in how the acceleration structure is built/managed if I remember correctly. A good indicator for this if the commit introduces changes to the scene_optix.inl file. You might also want to skip any commits that do not pass our CI tests suite (green check mark next to the commit on GitHub), that should save some time with commits that aren't even "valid".

Good luck :crossed_fingers:

maxfrei750 commented 1 year ago

Thanks for the hints!

I don't think explicit synchronization is ever need now

I tried explicit synchronization at different parts of my code, but it didn't help, which is a good thing in my opinion.

Indeed, I would assume this was introduced in a commit where we update the drjit submodule.

I was afraid that this might be the case, since that would require another bisection, this time on the level of drjit. And maybe, even a third one, on the level of drjit-core. :sweat_smile: Since each of these tests takes 4–7 hours, that's going to take some time. But let's cross that bridge, when we come to it...

Thanks for your help and your awesome work on this project. :+1:

maxfrei750 commented 1 year ago

Over the weekend, I made progress on the bisection. While I initially thought that the error was pre-3.1, that was actually a different error, which has been fixed by now. Therefore, I widened the search range to 3.2.1. These are the results:

------------- Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.

2c034785 Patch release v3.2.1
50087d07 Bump drjit version requirements to v0.4.1
a48e0423 Update drjit submodule
0338d57f Update release notes
c3edd688 Hide rendering progress bar depending on logging level
2322b915 Improve numerical stability of MIS weights for Pyhton integrators
a70bab4d Add tests for directionalarea plugin
d306a082 Fix self intersections produced by directionalarea plugin
8ad56b20 Update drjit submodule
8e441f03 Fix Struct converter when SSE is not available
b876a4f5 Fix albedo AOV in scalar modes
43a6b65b Mark PerspectiveCamera parameters as differentiable in the documentation
a517a52e Remove unnecessary test output
377215e8 Add test for parallel scene loading with Python plugins
93bb99b1 Set missing variant information before constructing Python plugins
b253ddee Update drjit submodule
cb057360 Update release notes
0b483bff Numerically robust ray-sphere intersection (CPU)
7d46e101 Numerically robust ray-sphere intersection (CUDA)
a02b57e7 Update release notes
04401467 Update tutorials submodule
8f03c7db Add missing Python bindings for Sensor/Emitter/Endpoint

------------- Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.

0e497163 blendphase: Fix missing initialization
ed7faa70 Various documentation improvements
afeefedc Error-compensated sample accumulation in ImageBlock
64fedcd4 fix ImageBlock issues

------------- Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.

16c8d2ae update Dr.Jit version
bd6fefe8 Correct mis_compensation docs
0fe59888 Fix several spelling errors

------------- Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!

8ee26060 Minor release v3.2.0
71185640 Bump drjit version requirements
4d700bb2 Update release notes
dfd49fe9 Fix formatting in release notes
37b0ed05 Documentation: remove edit button, limit TOC depth on API reference page

------------- Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!

90f0d733 Update drjit submodule
16b133ec Fix memory leak in OptixDenoiser when using Bitmaps
8ebf3af3 Remove unnecessary file
cca5791a Expose PhaseFunction of Medium with `mi.traverse()`
7e7d7b97 Upgrade to new Dr.Jit version using an abstract node-based IR
820c38e5 Remove 4th channel in envmap when using RGB variants
7bf132f6 Fix bug in envmap that would create a horizontal seam in JIT variants
10d3514a fixed HG issues in issue #455

------------- good
------------- slow again

176337c0 Update Dr.Jit(Core) with modularized LLVM backend (#447)
187da96a Fix consistent PTX order with OptiX configs
2ff5dd49 quench uninitialized variable warning in scalar mode
6a52f5bb fix uninitialized variable in xml.cpp, minor formatting
9dbb48e0 minor simplifications, indentation
f15a6387 Ensure that scenes loaded in parallel don't break kernel caching
1847bab2 plugin.cpp: Flush side effects after plugin construction
f4c0db0b parallel scene loading, fix GIL issue
09a7bf01 Quench warnings on GCC
48c14a70 Major Dr.Jit update
4371f4ef Coalesce ImageBlock writes in CUDA/OptiX mode (#413)
f3ac81bc Spectral mode: fix accidental conversion to double
9bd47c76 Polarization: fix typo in magnetic field naming
3123b80d Update drjit submodule
8c90a87d Add registry function to ArrayPtr bindings

------------- Already fixed: Critical Dr.Jit compiler failure: jit_optix_compile(): optixModuleGetCompilationState() indicates that the compilation did not complete succesfully. The module's compilation state is: 0x2363
------------- still faster

cf456d7f Update DrJit submodule
4cd55858 Fix Python binding for scene.sensors()
bdce9509 Add missing Python bindings for Shape & ShapePtr
199b607d parallelize build matrix further
009193a7 adaptations for Python 3.11
94179a42 added PyPI badge

------------- Already fixed: Critical Dr.Jit compiler failure: jit_optix_compile(): optixModuleGetCompilationState() indicates that the compilation did not complete succesfully. The module's compilation state is: 0x2363
------------- faster

ac1201a5 Patch release v3.1.1
a8e69898 fixed limits for long multi-pass renderings
204d89b2 Minor release v3.1.0
deb5f644 [GHA] Remove authentication to checkout repository during wheel build (repository is now public)
a9277985 Increase PyPI version of DrJit, update drjit submodule
b0458bb6 Update release notes
df56a290 Update drjit (remove predicates)
b9a55c62 Minor fix in HDRFilm::develop()
c55e9596 Lower required Nvidia GPU compute capability
90b52dd6 Set correct CUDA contexts before OptiX operations when necessary
8ba8528a deprecated samples_per_pass parameter, removed oldpath.cpp
47a3958e Fix tests in test_cylinder and test_rectangle
7d846c60 Upgrade drjit submodule
5c071927 Revert "Remove make_opaque() in plugins"
b9e95b20 quench warning on recent Clang versions
e7cc1274 minor: quench warning on MSVC
d8db806a fix progress bar on windows
af06e8a9 Support large launches (#394)
28660f3a obj.cpp: don't use memory-mapped I/O on windows
56ea9100 fixed a few issues detected via MSVC warnings
c2285ad8 Update drjit submodule
119e8ffd Use BSDF::eval_pdf_sample() in path.cpp
9f2ec267 Update drjit submodule

------------- good

To me, this indicates that the current error was introduced in 16c8d2ae3d5e22d4bc4db493000f23de50d9a0dc. However, from my newbie perspective, it's possible that 16c8d2ae3d5e22d4bc4db493000f23de50d9a0dc just altered the symptom, and the actual error was introduced earlier and is related to Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!. @njroussel It would be great, if you could make a more educated assessment in this regard.

Also, does this narrow down the error far enough, or would you like me to carry out more experiments, e.g. with respect to the drjit-core version?

Off-topic

I also made an interesting observation with regard to the speed of my optimization, where version 3.1.1 introduced a significant speed-up (~ factor 2). Unfortunately, runs with this version crashed with the aforementioned, already fixed error:

Critical Dr.Jit compiler failure: jit_optix_compile(): optixModuleGetCompilationState() indicates that the compilation did not complete succesfully. The module's compilation state is: 0x2363

Along with the fix, the speed-up went away. Just so that you know, there might be potential for a significant speed-up.

njroussel commented 1 year ago

That's somewhat good news!

I also think the commit you pointed out simply altered the symptom. From your bisection, I think this line is what causes the change in error message. This has to do with how the ray-scene intersection operation is represented in the JIT. Before digging any deeper, could you maybe try running these experiments with the LLVM variant? Hopefully, this would narrow the search. The Embree/Optix code paths are quite a bit different for these ray intersection routines.


At the time of v3.1.1 we were really looking into the performance of Mitsuba/Dr.Jit, hence the significant speedup. We don't currently have any performance regression testing. Thank you for reporting this, we'll eventually look into it.

maxfrei750 commented 1 year ago

I also think the commit you pointed out simply altered the symptom.

That would mean that the error was introduced in this range:

------------- Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!

90f0d733 Update drjit submodule
16b133ec Fix memory leak in OptixDenoiser when using Bitmaps
8ebf3af3 Remove unnecessary file
cca5791a Expose PhaseFunction of Medium with `mi.traverse()`
7e7d7b97 Upgrade to new Dr.Jit version using an abstract node-based IR
820c38e5 Remove 4th channel in envmap when using RGB variants
7bf132f6 Fix bug in envmap that would create a horizontal seam in JIT variants
10d3514a fixed HG issues in issue #455

------------- good

Based on my currently running bisection run, I even feel fairly confident to say that the error is related to 90f0d733.

Before digging any deeper, could you maybe try running these experiments with the LLVM variant?

Based on your proposal, I'd try to run ~16b133ec~ 90f0d733 with LLVM overnight.

njroussel commented 1 year ago

I agree, either 90f0d733 or 7e7d7b97 (this commit was a huge re-factor of Dr.Jit).

maxfrei750 commented 1 year ago

Based on my currently running bisection run, I even feel fairly confident to say that the error is related to 90f0d733.

What I intended to say was that I'm currently (as we speak) testing 16b133ec, and it's looking good so far. So that would mean that 90f0d733 is to blame, which would be good IMHO, since it has less code to comb through.

maxfrei750 commented 1 year ago

Interestingly enough, I just finished the test run of 16b133ec, and it crashed on the very last iteration:

Critical Dr.Jit compiler failure: jit_assemble(): schedule contains variable r396505013 with incompatible size (49920 and 128)!

Since I found it very odd that it would crash on the last iteration, I repeated the run with just 10 iterations and lo and behold, I received a nearly identical error (just the variable name was different). So that's an entirely new behavior. The new bisection results are (only the relevant part):

------------- Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.

16c8d2ae update Dr.Jit version
bd6fefe8 Correct mis_compensation docs
0fe59888 Fix several spelling errors

------------- Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!

8ee26060 Minor release v3.2.0
71185640 Bump drjit version requirements
4d700bb2 Update release notes
dfd49fe9 Fix formatting in release notes
37b0ed05 Documentation: remove edit button, limit TOC depth on API reference page

------------- Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!

90f0d733 Update drjit submodule

------------- Critical Dr.Jit compiler failure: jit_assemble(): schedule contains variable r396505013 with incompatible size (49920 and 128)!

16b133ec Fix memory leak in OptixDenoiser when using Bitmaps
8ebf3af3 Remove unnecessary file
cca5791a Expose PhaseFunction of Medium with `mi.traverse()`
7e7d7b97 Upgrade to new Dr.Jit version using an abstract node-based IR
820c38e5 Remove 4th channel in envmap when using RGB variants
7bf132f6 Fix bug in envmap that would create a horizontal seam in JIT variants
10d3514a fixed HG issues in issue #455

------------- good

Still, I'm going to run 90f0d733 with LLVM now and report back, with new results.

maxfrei750 commented 1 year ago

Still, I'm going to run 90f0d733 with LLVM now and report back, with new results.

Killed

That's it. :sweat_smile:

Interestingly, the optimization crashed in an earlier iteration, but (at least to me) the message does not shed more light on the cause. @njroussel Hopefully, that's different for you.

maxfrei750 commented 1 year ago

I just finished the bisection and these are the final results:

------------- 10000 iterations: Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203.
------------- 10 iterations: good

16c8d2ae update Dr.Jit version
bd6fefe8 Correct mis_compensation docs
0fe59888 Fix several spelling errors

------------- 10000 iterations: Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!
------------- 10 iterations: Critical Dr.Jit compiler failure: jit_assemble(): schedule contains variable r357466 with incompatible size (49920 and 128)!

8ee26060 Minor release v3.2.0
71185640 Bump drjit version requirements
4d700bb2 Update release notes
dfd49fe9 Fix formatting in release notes
37b0ed05 Documentation: remove edit button, limit TOC depth on API reference page

------------- 10000 iterations (CUDA): Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable!
------------- 10000 iterations (LLVM): Killed
------------- 10 iterations: Critical Dr.Jit compiler failure: jit_assemble(): schedule contains variable r357466 with incompatible size (49920 and 128)!

90f0d733 Update drjit submodule
16b133ec Fix memory leak in OptixDenoiser when using Bitmaps
8ebf3af3 Remove unnecessary file
cca5791a Expose PhaseFunction of Medium with `mi.traverse()`

------------- 10000 iterations: Critical Dr.Jit compiler failure: jit_assemble(): schedule contains variable r357466 with incompatible size (49920 and 128)!

7e7d7b97 Upgrade to new Dr.Jit version using an abstract node-based IR

------------- good

To summarize:

7e7d7b97 was a major refactoring and introduced Critical Dr.Jit compiler failure: jit_assemble(): schedule contains variable r357466 with incompatible size (49920 and 128)!, which only occurs after the optimization is practically done. This means that it may run for many iterations (e.g. 10000).

90f0d733 The previous error still persists, but now it is impossible to run for 10000 iterations, because the optimization crashes with Critical Dr.Jit compiler failure: jit_var(r50688206): unknown variable! on CUDA and Killed on LLVM.

16c8d2ae Optimizations with just a few (e.g. 10) iterations work flawlessly, but longer optimizations crash with Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /src/mitsuba3/ext/drjit/ext/drjit-core/src/util.cpp:203..

njroussel commented 1 year ago

Thanks for the detailed bisection!

I might have some time to look at this more in detail at the end of this week, or next week. I'm surprised we've never encountered this ourselves. I might then ask you to run a couple more tests with specific drjit-core commits, unless you can provide me with a reproducer (don't worry about it being huge).

I really hoped the LLVM version wouldn't crash...

maxfrei750 commented 1 year ago

I might have some time to look at this more in detail at the end of this week, or next week.

Thanks for caring and taking the time!

I'm surprised we've never encountered this ourselves.

I guess that it may be related to the large number of iterations I'm using and the use of an envmap that's being optimized. These two things combined result in a huge number of parameter updates. Additionally, the use of wrap_ad may cause problems. Also, I'm absolutely open to the idea that I made a fundamental stupidity in the implementation. :smile:

I might then ask you to run a couple more tests with specific drjit-core commits

Gladly!

unless you can provide me with a reproducer (don't worry about it being huge).

I'd be more than happy to invite you to our private repo with the code, provide the required data and add brief instructions to reproduce the problem. If you're ok with that, then please just let me know, and I'll create the instructions and add you.

maxfrei750 commented 1 year ago

I reckon that it would be good, if I give you a brief rundown what the reproduction would entail, so that you can make an informed decision:

  1. Clone the repo.
  2. Install the dependencies or use our docker container.
  3. Download and unzip a zip.
  4. Run a script in your command line.
maxfrei750 commented 1 year ago

@njroussel I'd like to update Mitsuba and therefore drjit, because of https://github.com/mitsuba-renderer/mitsuba3/commit/e3a9a3a3bdecb3c364d604db4fc3a674c057dd6d. Is there a chance that this issue here has been alleviated in the meantime?

merlinND commented 8 months ago

Hello @maxfrei750,

Thank you for the huge efforts you put into investigating this bug. Could you please try running your test again with the latest Mitsuba master, which includes a fix for a similar-sounding bug? (https://github.com/mitsuba-renderer/drjit-core/pull/78)