Why is the optimizer step so slow?

DoeringChristian commented 1 year ago

Hi, I have a larger scene with 5868 parameters I want to optimize. I can render it and calculate the gradient of each parameter using dr.backward without any problem. When the optimizer calls dr.eval in step however drjit takes a long time to complete a single step. Even when using SGD with no momentum or implementing it myself it is an issue. Shouldn't the optimization step be the least computationally expensive if gradients have already been evaluated or did I get something wrong?

wjakob commented 1 year ago

There are too few details: what is the setup, what does your optimization loop look like? Which integrator are you using?

In general, backpropagation through a rendering algorithm involves an entirely separate simulation that is more costly than the original primal simulation (it needs to compute derivatives of various quantities and write them to memory in addition to sampling light paths).

DoeringChristian commented 1 year ago

Thanks for the quick response. I'm trying to write an implementation of this paper. So I implemented an integrator that is only differentiable over the parameters from the first bounce: $L1 + f1 * L2$ Where L1 and f1 are differentiable but L2 is the radiance of further bounces and is not differentiable. My scene has 1956 meshes which are all emitters. I use the principled bsdf but only keep emission, roughness, specular and base_color parameters to be optimized. Each objects base_color is a 10x10 texture and the other values are simple scalar/color values. I have implemented a texture space integrator as described in the paper. Camera rays are generated with gradients disabled since I have to generate multiple batches of rays and concatenate them. The loss function at the moment is a simple mse loss over all sample points. I have tried it with the mitsuba render function which Integrates twice as you mentioned. I also tried it when bypassing the render function, which should be possible in this case since discontinuities don't depend on the parameters if the bsdf function is smooth. The result however was the same. The optimization loop is similar to that of the example.

params = mi.traverse(scene)
params.keep(r".*?((\.bsdf\.base_color\.value)|(\.bsdf\.specular)|(\.bsdf\.roughness\.value)|(\.emitter\.radiance\.value))")

opt = mi.ad.Adam(lr=0.1, params=params)

for it in range(n):
    for key, _ in params:
        opt[key] = dr.clamp(opt[key], 0., 1.)
    for key, _ in params:
        params[key] = opt[key]

    img, projected = render(scene, sensor, integrator, params, it)

    ref = mi.Color3f(refimg.eval(mi.Point3f(projected.x, projected.y))[0:3]

    loss = lossfn(img, ref)

    dr.backward(loss)

    print(loss)
    opt.step() # this step takes very long.
    print("This gets not printed for long time.")

I can backpropagate and even read out the gradient values of the parameters before the optimization step. When setting the log_level of drjit to Debug I get a long output but I don't quite understand what drjit is doing. I should in theory just update the values. Thanks for your Help.

Speierers commented 1 year ago

You need to keep in mind that Dr.Jit will postpone all kernel compilation & evaluation until it can't do so, e.g. when the user calls dr.eval(). In your case, the dr.eval() in Optimizer.step() likely performs the compilation and evaluation of the backward rendering kernel as well, which can be expensive. My point is that it is likely that this line of code doesn't only compute the optimizer's update rule, but much more than that.

To verify this, you can explicitly evaluate the result of the backward rendering routine before calling opt.step(), e.g. by calling dr.eval(params). By doing so, the dr.eval() in Optimizer.step() will now only perform the update rule which should be pretty cheap.

DoeringChristian commented 1 year ago

I think I understand now. But strangely the compilation still seems to happen at the optimizer step. I tried the following:

dr.eval(params)

dr.eval(opt)

for k, p in params:
    dr.eval(p)

for k, p in params:
    dr.eval(opt[k])

for k, p in params:
    dr.schedule(p)
dr.eval()

for k, p in params:
   dr.schedule(opt[k])
dr.eval()

Speierers commented 1 year ago

Did you try the following?

for k, v in opt.items():
   dr.schedule(v)
dr.eval()

DoeringChristian commented 1 year ago

It still hangs in the optimization step.

Speierers commented 1 year ago

Actually it is important that you evaluate those values as well as their gradient values. Could you try the following:

for k, v in opt.items():
   dr.schedule(v, dr.grad(v))
dr.eval()

DoeringChristian commented 1 year ago

Unfortunately it is still happening when evaluating the gradients too. I have also tested this:

values = {}
for k, v in opt.items():
    print(f"Evaluating {k}...")
    g = dr.grad(v)
    values[k] = dr.detach(v) - 0.01 * g
    dr.schedule(values[k])
dr.eval()
opt.step()

In that case the compilation seems to happen at the first dr.eval() so maybe in your example the variables just ran out of scope before being evaluated. I also tried out implementing an optimizer and calling dr.eval() for every parameter which seems to have fixed the issue.

for k, v in opt.items():
    g = dr.grad(v)
    values = dr.detach(v) - 0.01 * g
    dr.schedule(values)
    dr.eval()
    opt[k] = value
    dr.enable_grad(opt[k])

And in the adam optimizer by moving this line into the for loop. Do you know why this could be? Is it maybe more efficient to compile multiple smaller kernels?

Speierers commented 1 year ago

Does this behavior still holds when using the mi.render() function? I would like to try this on my end if possible.

DoeringChristian commented 1 year ago

Yes interestingly it is the same with and without mi.render(). I also tested weather the optimization is the same for both methods and it seems that way. I tested it using the basic cornell box optimization example but with the red and green wall. Optimizing using original mitsuba Adam optimizer: mi.webm Optimizing using modified Adam optimizer: own.webm In this case of course the performance difference is negligible.

Speierers commented 1 year ago

On my side most of the time is spent in mi.render and dr.backward using the following script:

import drjit as dr
import mitsuba as mi

mi.set_variant('llvm_ad_rgb')

scene = mi.load_dict(mi.cornell_box())

ref = mi.render(scene)

params = mi.traverse(scene)
params.keep(r".*?((\.bsdf\.base_color\.value)|(\.bsdf\.specular)|(\.bsdf\.roughness\.value)|(\.emitter\.radiance\.value))")

opt = mi.ad.Adam(lr=0.1, params=params)

for it in range(4):
    print(f'iteration {it} ----')

    for key, _ in params:
        opt[key] = dr.clamp(opt[key], 0., 1.)
    for key, _ in params:
        params[key] = opt[key]

    print(f'  render ...')
    img = mi.render(scene, params, spp=1024)
    print(f'  loss ...')
    loss = dr.sum(dr.abs(img - ref))
    print(f'  backward ...')
    dr.backward(loss)
    print(f'  step ...')
    opt.step()
    print("   done.")

Could you check that you can reproduce this "normal" behavior on your side as well?

DoeringChristian commented 1 year ago

It seems that 1024 samples per pixels are a bit much for my pc(32 GB ram) and the program terminates. I tried it with spp=512 but yes most time is spent in the backward step. I think the issue only occurs if there are many parameters that get optimized.

mitsuba-renderer / mitsuba3

Why is the optimizer step so slow? #337