DrJit possibly overflows some OptiX or cuda internal data structure

futscdav commented 6 months ago

I've been running into Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered" when trying to optimize some larger cases. I don't have a simple code to reproduce this, since my working code is rather large. This could be related to issue #125.

Running under CUDA_LAUNCH_BLOCKING=1 and compute-sanitizer, I've managed to produce a little bit of context of what might be happening. In the provided stack trace, the top frames are as follows:

========= Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered" on CUDA API call to cuEventRecord.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x63a44b]
=========                in /usr/lib/x86_64-linux-gnu/libnvoptix.so.1
=========     Host Frame:jitc_optix_launch(ThreadState*, Kernel const&, unsigned int, void const*, unsigned int) [0x6f6b1968]
=========                in .../my_opt_problem
=========     Host Frame:jitc_run(ThreadState*, ScheduledGroup) [0x6f68d193]
=========                in .../my_opt_problem
=========     Host Frame:jitc_eval(ThreadState*) [0x6f68daa8]
=========                in .../my_opt_problem
=========     Host Frame:jitc_var_gather(unsigned int, unsigned int, unsigned int) [0x6f65bcb0]
=========                in .../my_opt_problem
=========     Host Frame:jit_var_gather [0x6f6aba4b]
...

This happens after roughly ~320 forward/backward passes, and is 100% reproducible with my setup, it's not randomly occurring. Anecdotally, the total number of compiled OptiX kernel ops in those passes (as reported by dr.LogLevel.Info with cache misses) is a small smidge over 8 million, which could be interesting or it could be completely coincidental. The reason I think it's overflowing some internal data structure is that if I manually call

dr.sync_thread()
dr.flush_kernel_cache()
dr.flush_malloc_cache()  # Edit: this is not required for the error to disappear.
dr.sync_thread()

after each optimization step, error seems to disappear. If I'm right, a reproducing case could be as easy as writing something that indeed causes a large amount of compilations that add up.

Related, is there a writeup of when to expect cache misses? So far I've observed that if geometry in the scene changes, I get a cache miss.

futscdav commented 6 months ago

Ok, I managed to simplify the problem into a self contained snippet. This code runs into the issue after 1091 iterations on 4090 with cuda 12.3, which takes about 20 minutes to reproduce.

import drjit as dr
import mitsuba as mi
import numpy as np

mi.set_variant("cuda_ad_rgb")

def dr_backward(output, output_grad):
  dr.set_grad(output, output_grad)
  dr.enqueue(dr.ADMode.Backward, output)
  dr.traverse(output, dr.ADMode.Backward)

dr.set_log_level(dr.LogLevel.Info)
vert_count = 40_000
face_count = 38_000

verts = np.random.uniform(0, 1, [vert_count, 3])
faces = np.random.randint(0, vert_count, [face_count, 3])

mesh = mi.Mesh(
    'mesh',
    vertex_count=vert_count,
    face_count=face_count,
    has_vertex_normals=True,
    has_vertex_texcoords=True,
    props=mi.Properties(),
)
mesh_params = mi.traverse(mesh)
mesh_params['vertex_positions'] = np.ravel(verts)
mesh_params['faces'] = np.ravel(faces)
mesh_params.update()

scene = mi.load_dict({
    'type': 'scene',
    'integrator': {'type': 'direct'},
    'emitter': {'type': 'constant'},
    'shape': mesh,
    'sensor': {
        'type': 'perspective',
        'to_world': mi.ScalarTransform4f.look_at(
            [0, 0, 5], [0, 0, 0], [0, 1, 0]
        ),
        'film': {
            'type': 'hdrfilm',
            'width': 512,
            'height': 512,
            'pixel_format': 'rgb',
        },
    },
})

params = mi.traverse(scene)
for i in range(10000):
  print(f'Iteration {i}')
  new_verts = mi.Float(np.ravel(np.random.uniform(0, 1, [vert_count, 3])))
  dr.enable_grad(new_verts)
  params['shape.vertex_positions'] = new_verts
  params.update()
  image = mi.render(scene, params, spp=4)
  grad_image = np.random.uniform(0, 1e-4, image.shape)
  dr_backward(image, grad_image)
  dr.grad(new_verts)

futscdav commented 6 months ago

Somewhat related: mitsuba-renderer/mitsuba3#1033 is what causes the recompilation on each iteration here, but nevertheless the program shouldn't crash. Using the snippet in that issue the problem can be reproduced slightly faster, in about 10 minutes or 3300 iterations.

futscdav commented 6 months ago

Reading through some mitsuba issues, this is likely also the cause of mitsuba-renderer/mitsuba3#703.

merlinND commented 5 months ago

Hello @futscdav,

Thank you for reporting this bug and posting a reproducer. Could you please try running your test again with the latest Mitsuba master, which includes a fix for a similar-sounding crash? (https://github.com/mitsuba-renderer/drjit-core/pull/78)

mitsuba-renderer / drjit

DrJit possibly overflows some OptiX or cuda internal data structure #210