Extremely long per-iteration jit compilation in PyTorch/Drjit interoperability

WeiPhil commented 1 year ago

Hi, I'm trying to build a pipeline to optimise a small Pytorch MLP using mitsuba-generated samples (non-AD) and I'm facing some major performance issues on the first execution of the script, subsequent runs are fast. I therefore suspect the jit compilation to be run independently for each iteration instead of reusing recorded loops.

Here is a reproduction case :

import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import drjit as dr
import mitsuba as mi
import numpy as np
mi.set_variant('cuda_rgb')

import matplotlib.pyplot as plt

bsdf = mi.load_dict({
    'type': 'diffuse',
    'reflectance': {
        'type': 'rgb',
        'value': [0.2, 0.25, 0.7]
    }
})

def sph_to_dir(theta, phi):
    """Map spherical to Euclidean coordinates"""
    st, ct = dr.sincos(theta)
    sp, cp = dr.sincos(phi)
    return mi.Vector3f(cp * st, sp * st, ct)

class MLP(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()

        self.input_fc = nn.Linear(input_dim, 32)
        self.hidden_fc = nn.Linear(32, 32)
        self.output_fc = nn.Linear(32, output_dim)

    def forward(self, x):
        x = x.torch()
        batch_size = x.shape[0]
        x = x.view(batch_size, -1)
        h_1 = F.relu(self.input_fc(x))
        h_2 = F.relu(self.hidden_fc(h_1))
        y_pred = self.output_fc(h_2)
        return y_pred

device = torch.device("cuda")

n_input_dims = 3
n_output_dims = 3

model = MLP(n_input_dims,n_output_dims)

if 'cuda' in mi.variant():
    model = model.cuda()

batch_size = 2**10
total_steps = 1000
interval = 100
prev_time = time.perf_counter()

sampler = mi.PCG32(size=batch_size)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

wi_samples = dr.repeat(sph_to_dir(0,0),batch_size)

for i in range(total_steps):

    sampler.seed(size=batch_size,initstate=i)

    wo_samples = mi.warp.square_to_uniform_sphere(mi.Vector2f(sampler.next_float32(),sampler.next_float32()))
    si = dr.zeros(mi.SurfaceInteraction3f,np.array(wo_samples).shape[0])
    si.wi = wi_samples
    bsdf_context = mi.BSDFContext()
    targets = bsdf.eval(bsdf_context,si,wo_samples)

    output = model(wo_samples)

    relative_l2_error = (output - targets.torch().to(output.dtype))**2 / (output.detach()**2 + 0.01)
    loss = relative_l2_error.mean()

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if i % interval == 0:
        loss_val = loss.item()
        torch.cuda.synchronize()
        elapsed_time = time.perf_counter() - prev_time
        print(f"Step#{i}: loss={loss_val} time={int(elapsed_time*1000)}[ms]")

        prev_time = time.perf_counter()

Am I missing a critical component to make this more efficient or is there a known workaround? I also tried the approach described by @merlinND in #125 to eval and sync threads before pytorch computation but without any success.

System configuration

OS: Windows 10 CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz GPU: Nvidia RTX A6000 Python: 3.9.5 NVidia driver: 516.01

Dr.Jit: 0.4.1 Mitsuba: 3.2.1 Is custom build? False Variants: scalar_rgb cuda_rgb cuda_ad_rgb

njroussel commented 1 year ago

Hi @WeiPhil

To be clear, there are absolutely no recorded loops in your snippet. DrJit will only record it's own loops and cannot do the same for "normal" Python loops (see documentation).

I quickly ran your code. You are correct, currently every iteration of your optimization loop produces different kernels, meaning each iteration needs to compile a new kernel. As you've seen, any re-execution will re-use the kernels of the first run. (All kernels are located in ~/.drjit if you ever want to manually clear it).

I'd recommend using dr.set_log_level(dr.LogLevel.Info). This will show you every kernel launch and will even tell you if there was cache hit/miss.

A few quick hints: 1) si = dr.zeros(mi.SurfaceInteraction3f,np.array(wo_samples).shape[0]) just use batch_size. You're copying GPU memory to CPU just to get the shape. (This produces a kernel) 2) Every iteration depends on a different seed, and is most likely creating a different kernel because of it. Try using the independent sampler. It encapsulates some nice tricks to avoid kernels being different just because of a different seed.

Basically you want to try to reduce your number of kernels per iteration and absolutely make sure that at least the "large" ones are always identical to avoid re-compiling large kernels.

WeiPhil commented 1 year ago

Hi @njroussel Thanks for the quick answer and clarifications! I added your suggested changes but I didn't see any major speed improvement. I suppose that as soon as we work with components outside of drjit (here pytorch) and can't use mi.Loop we run into performance issues like these ? Would moving the implementation to C++/cuda directly help performance to a certain extent? Best, Philippe

njroussel commented 1 year ago

Ok, maybe we don't have the same definition of "major" speed improvements.

Your initial script on its first execution (I deleted ~/.drjit):

Step#0: loss=0.5709309577941895 time=828[ms]
Step#100: loss=0.00042789752478711307 time=4975[ms]
Step#200: loss=0.00014488480519503355 time=4981[ms]
Step#300: loss=8.426234126091003e-05 time=4983[ms]
Step#400: loss=5.341646829037927e-05 time=4969[ms]
Step#500: loss=4.1911815060302615e-05 time=4964[ms]
Step#600: loss=3.170070340274833e-05 time=4992[ms]
Step#700: loss=2.749029044935014e-05 time=4985[ms]
Step#800: loss=3.887559796567075e-05 time=5045[ms]
Step#900: loss=1.529646760900505e-05 time=4973[ms]

Re-executing it right after

Step#0: loss=0.5033278465270996 time=772[ms]
Step#100: loss=0.0005800747312605381 time=383[ms]
Step#200: loss=0.00020354261505417526 time=293[ms]
Step#300: loss=0.00013471630518324673 time=309[ms]
Step#400: loss=8.440863894065842e-05 time=292[ms]
Step#500: loss=6.424191815312952e-05 time=294[ms]
Step#600: loss=5.3118659707251936e-05 time=295[ms]
Step#700: loss=4.745090700453147e-05 time=295[ms]
Step#800: loss=3.9088419725885615e-05 time=372[ms]
Step#900: loss=3.5130295145791024e-05 time=300[ms]

So this is our baseline speedup - here every single kernel should be read from the cache, there is no compilation.

Finally, here are the results after applying the suggested changes (again, I cleared the cache ~/.drjit before running it):

Step#0: loss=0.7939680814743042 time=823[ms]
Step#100: loss=0.00035265518818050623 time=250[ms]
Step#200: loss=0.00011982768046436831 time=240[ms]
Step#300: loss=6.591202691197395e-05 time=241[ms]
Step#400: loss=4.7509485739283264e-05 time=238[ms]
Step#500: loss=3.653716339613311e-05 time=241[ms]
Step#600: loss=2.7595277060754597e-05 time=260[ms]
Step#700: loss=2.0491474060690962e-05 time=240[ms]
Step#800: loss=1.6189884263440035e-05 time=239[ms]
Step#900: loss=1.2601549315149896e-05 time=240[ms]

The first iteration should be slower as that one needs to be compiled, but every iteration after it should use a cached kernel.

Here's the full diff:

diff --git a/original.py b/reproducer.py
index f967f24..f44d6d6 100644
--- a/original.py
+++ b/reproducer.py
@@ -55,22 +55,25 @@ total_steps = 1000
 interval = 100
 prev_time = time.perf_counter()

-sampler = mi.PCG32(size=batch_size)
+sampler = mi.load_dict({
+    'type': 'independent'
+})

 optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

 wi_samples = dr.repeat(sph_to_dir(0,0),batch_size)

 for i in range(total_steps):
-
-    sampler.seed(size=batch_size,initstate=i)
+    sampler.seed(i, batch_size)

-    wo_samples = mi.warp.square_to_uniform_sphere(mi.Vector2f(sampler.next_float32(),sampler.next_float32()))
-    si = dr.zeros(mi.SurfaceInteraction3f,np.array(wo_samples).shape[0])
+    wo_samples = mi.warp.square_to_uniform_sphere(sampler.next_2d())
+    si = dr.zeros(mi.SurfaceInteraction3f, batch_size)
     si.wi = wi_samples
     bsdf_context = mi.BSDFContext()
     targets = bsdf.eval(bsdf_context,si,wo_samples)

+    dr.schedule(wo_samples, targets)
+
     output = model(wo_samples)

     relative_l2_error = (output - targets.torch().to(output.dtype))**2 / (output.detach()**2 + 0.01)

Moving to C++/CUDA definitely will remove some of the (little) overhead that is still remaining. In the measurements above, Dr.JIT is about 10% of the total runtime. I don't know if that's worth the effort in your use case.

WeiPhil commented 1 year ago

My apologies, your solution works great and we do have the same notion of speed improvement! I used

sampler = mi.load_dict({
    'type' : 'independent',
    'sample_count' : batch_size
})

but then I removed the sampler.seed call inside the for loop which makes a big difference here and seems to be the same issue that was referred here https://github.com/mitsuba-renderer/mitsuba3/issues/369 Thank you again, I'll close this one :)

njroussel commented 1 year ago

Oh indeed, that's a classic pitfall :slightly_frowning_face:

Great! (I forgot to mention it earlier, but the explicit synchronization really shouldn't be needed anymore unless you're manually fiddling with CUDA streams.)

mitsuba-renderer / drjit

Extremely long per-iteration jit compilation in PyTorch/Drjit interoperability #143

System configuration