Open futscdav opened 6 months ago
Ok, I managed to simplify the problem into a self contained snippet. This code runs into the issue after 1091 iterations on 4090 with cuda 12.3, which takes about 20 minutes to reproduce.
import drjit as dr
import mitsuba as mi
import numpy as np
mi.set_variant("cuda_ad_rgb")
def dr_backward(output, output_grad):
dr.set_grad(output, output_grad)
dr.enqueue(dr.ADMode.Backward, output)
dr.traverse(output, dr.ADMode.Backward)
dr.set_log_level(dr.LogLevel.Info)
vert_count = 40_000
face_count = 38_000
verts = np.random.uniform(0, 1, [vert_count, 3])
faces = np.random.randint(0, vert_count, [face_count, 3])
mesh = mi.Mesh(
'mesh',
vertex_count=vert_count,
face_count=face_count,
has_vertex_normals=True,
has_vertex_texcoords=True,
props=mi.Properties(),
)
mesh_params = mi.traverse(mesh)
mesh_params['vertex_positions'] = np.ravel(verts)
mesh_params['faces'] = np.ravel(faces)
mesh_params.update()
scene = mi.load_dict({
'type': 'scene',
'integrator': {'type': 'direct'},
'emitter': {'type': 'constant'},
'shape': mesh,
'sensor': {
'type': 'perspective',
'to_world': mi.ScalarTransform4f.look_at(
[0, 0, 5], [0, 0, 0], [0, 1, 0]
),
'film': {
'type': 'hdrfilm',
'width': 512,
'height': 512,
'pixel_format': 'rgb',
},
},
})
params = mi.traverse(scene)
for i in range(10000):
print(f'Iteration {i}')
new_verts = mi.Float(np.ravel(np.random.uniform(0, 1, [vert_count, 3])))
dr.enable_grad(new_verts)
params['shape.vertex_positions'] = new_verts
params.update()
image = mi.render(scene, params, spp=4)
grad_image = np.random.uniform(0, 1e-4, image.shape)
dr_backward(image, grad_image)
dr.grad(new_verts)
Somewhat related: mitsuba-renderer/mitsuba3#1033 is what causes the recompilation on each iteration here, but nevertheless the program shouldn't crash. Using the snippet in that issue the problem can be reproduced slightly faster, in about 10 minutes or 3300 iterations.
Reading through some mitsuba issues, this is likely also the cause of mitsuba-renderer/mitsuba3#703.
Hello @futscdav,
Thank you for reporting this bug and posting a reproducer.
Could you please try running your test again with the latest Mitsuba master
, which includes a fix for a similar-sounding crash? (https://github.com/mitsuba-renderer/drjit-core/pull/78)
I've been running into
Program hit CUDA_ERROR_ILLEGAL_ADDRESS (error 700) due to "an illegal memory access was encountered"
when trying to optimize some larger cases. I don't have a simple code to reproduce this, since my working code is rather large. This could be related to issue #125.Running under
CUDA_LAUNCH_BLOCKING=1
andcompute-sanitizer
, I've managed to produce a little bit of context of what might be happening. In the provided stack trace, the top frames are as follows:This happens after roughly ~320 forward/backward passes, and is 100% reproducible with my setup, it's not randomly occurring. Anecdotally, the total number of compiled OptiX kernel ops in those passes (as reported by
dr.LogLevel.Info
with cache misses) is a small smidge over 8 million, which could be interesting or it could be completely coincidental. The reason I think it's overflowing some internal data structure is that if I manually callafter each optimization step, error seems to disappear. If I'm right, a reproducing case could be as easy as writing something that indeed causes a large amount of compilations that add up.
Related, is there a writeup of when to expect cache misses? So far I've observed that if geometry in the scene changes, I get a cache miss.