[🐛 bug report] render_torch function crashed after several number of iterations

ligoudaner377 commented 3 years ago

Summary

Hi, thank you for your great job. Just like #74 , I am trying to use mitsuba2 renderer as a giant node to work with pytorch but with some translucent objects in the scene. However, I got RuntimeError after several iterations.

System configuration

Platform: Ubuntu 18.04
Compiler: clang 9.0.0-2 cmake 3.10.2 ninja 1.10.2
Python version: 3.8.5
Mitsuba 2 version: latest version
Compiled variants:
- scalar_rgb
- gpu_autodiff_rgb
GPU: RTX3090
nvcc: 11.0
pytorch: 1.8.1

Description

For simplicity, I built a test code like this:

i = 0
while True:
    params_torch[param_name] = torch.rand((1, 3), requires_grad=True).to(device)
    image = render_torch(scene, params=params, spp=1, **params_torch)
    i+=1
    print(i, end='|')

Where I use pytorch to generate 3 random float numbers to simulate the output of the neural network. I have some translucent objects in the scene, so the integrator is "volpathmis". After about 1000 iterations, I got RuntimeError like this:

1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|
......
|1259|1260|1261|1262|1263|1264|1265|1266|1267|1268|1269|1270|1271|1272|1273|1274|render_torch(): critical exception during forward pass: cuda_trace_append(): arithmetic involving uninitialized variable!
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-e1784109b9e1> in <module>
      2 while True:
      3     params_torch[param_name] = torch.clamp(torch.rand((1, 3), requires_grad=True).to(device), 0.0, 1.0)
----> 4     image = render_torch(scene, params=params, spp=1, **params_torch)
      5     i+=1
      6     print(i, end='|')

~/anaconda3/lib/python3.8/site-packages/mitsuba/python/autodiff.py in render_torch(scene, params, **kwargs)
    475         ns['render_torch_helper'] = render_torch
    476 
--> 477     result = render_torch(scene, params,
    478                           *[num for elem in kwargs.items() for num in elem])
    479 

~/anaconda3/lib/python3.8/site-packages/mitsuba/python/autodiff.py in forward(ctx, scene, params, *args)
    452                     print("render_torch(): critical exception during "
    453                           "forward pass: %s" % str(e))
--> 454                     raise e
    455 
    456             @staticmethod

~/anaconda3/lib/python3.8/site-packages/mitsuba/python/autodiff.py in forward(ctx, scene, params, *args)
    440                             ek.set_requires_gradient(v)
    441 
--> 442                     ctx.output = render(scene, spp=spp[1],
    443                                         sensor_index=sensor_index)
    444 

~/anaconda3/lib/python3.8/site-packages/mitsuba/python/autodiff.py in render(scene, spp, unbiased, optimizer, sensor_index)
    190             raise Exception('render(): unbiased=False requires that spp '
    191                             'is either an integer or None!')
--> 192         image = _render_helper(scene, spp=spp, sensor_index=sensor_index)
    193 
    194     return image

~/anaconda3/lib/python3.8/site-packages/mitsuba/python/autodiff.py in _render_helper(scene, spp, sensor_index)
     40     )
     41 
---> 42     spec, mask, aovs = scene.integrator().sample(scene, sampler, rays)
     43     spec *= weights
     44     del mask

RuntimeError: cuda_trace_append(): arithmetic involving uninitialized variable!

Observation

GPU memory didn't continue to increase during iteration
deleting the translucent object and switch the integrator to "path" can solve the problem
tried different "max_depth" (6, 16, 32, -1) and "spp" (1, 4, 8), didn't work
tried to use another scene file with translucent objects, didn't work
switched the integrator to "volpath", didn't work
tried cuda_malloc_trim=True, didn't work
Steps to reproduce

download this test file
compile mitsuba2 with gpu_autodiff_rgb mode
run test_render_torch.ipynb file in a jupyter notebook

Speierers commented 3 years ago

Hi @ligoudaner377 ,

switch the integrator to "path" can solve the problem

Which integrator are you using currently?

This error message means that the jit compiler is asked to performed computations with a variable that hasn't been initialized. For instance:

Float a; // uninitialized
Float b = 4.f;  // initialized
SurfaceInteraction3f si;  // uninitialized fields
SurfaceInteraction3f si_2 = zero<SurfaceInteraction3f>();  // initialized fields

I would recommend you compile the system in DEBUG mode and run your experiment in a debugger after adding a break point here. This way you should be able to figure out which variable is uninitialized and fix this in the code. You can also post the stack trace here so I can help you debug this.

ligoudaner377 commented 3 years ago

Hi, @Speierers Thank you for your quick reply. I'm using "volpathmis" integrator currently.

But is there any debugger or IDE that can debug python and c++ at the same time? Because my test script is written in python and enoki is written in c++.

Speierers commented 3 years ago

The current implementation of volpathmis hasn't been much tested so this could well be an implementation bug.

Not sure about the IDE. I would simply start debugging in Python first, and then move to C++ if the jit operation happens within C++.

mitsuba-renderer / mitsuba2