Sampling integrator: large memory usage (> 16 GB) for multipass rendering in vectorized variants

xacond00 commented 1 month ago

Description

Trying to render any scene with a sampling integrator from within the compiled C++ binary, causes out of memory error or crash, when the wavefront size limit is surpassed in any way. Eg. cornell box scene that uses 1024x1024 with 4095 spp (renders completely fine): 4095_spp 4096 spp (out of memory): 4096_spp 8192 spp (throws error):

The peak memory usage only gets 1000x higher for no reason... Even though mitsuba reports, the rendering was split into multiple passes... I thought that should lower memory usage, instead of increasing it ?

I'm also seeing the same thing, when trying to arbitrary split the rendering into smaller passes, withing the sampling integrator's code itself (eg. 4*64 vs full 256), the memory usage drastically increases, instead of decreasing ?

Steps to reproduce

master branch

... Compile C++ into binary
... Run llvm / cuda rgb variants on any scene with independent sampler and where spp * film_size surpasses wavefront_size_limit (0xffffffff).

Edit 1:

Here is a memory usage, when I split 1024 spp into 8 passes * 128 spp: And here I tried removing sampler->schedule_state() inside the multi-pass loop, still with 8 passes: Although the rendering time roughly doubled....

Is this the expected behavior ?

Edit 2:

I thought about it, and to me it seems, that drjit tries to evaluate the whole state at once, which works out to those 16 GB with a maximum wavefront size. This doesn't happen when the n_passes == 1, because of the condition in there.

if (n_passes > 1) {
                sampler->advance(); // Will trigger a kernel launch of size 1
                sampler->schedule_state();
                dr::eval(block->tensor());
}

But still feels like a huge oversight, and should be definitely fixed.

Edit 3.

Just repeatedly forking the original sampler is as fast the original, yet still only uses 16 MB of memory (the image looks unbiased, because of the seed)

 // Potentially render multiple passes
        for (size_t i = 0; i < n_passes; i++) {
            auto sampler = sensor->sampler()->fork();
            sampler->seed(i * 512, wavefront_size);
            render_sample(scene, sensor, sampler, block, aovs.get(), pos,
                          diff_scale_factor);

            if (n_passes > 1) {
                //sampler->advance(); // Will trigger a kernel launch of size 1
                //sampler->schedule_state();
                dr::eval(block->tensor());
            }
}

This definitely has to be a bug ? Since evaling whole image block doesn't cause memory usage to spike, but evaling bunch of Uint32 states in a sampler does ? Btw. the same behavior with stratified sampler... I would expect others to behave the same.

xacond00 commented 4 weeks ago

I've changed the title, because it might have seemed that I purposefully tried to run larger warp than warp limit, which was not case. This issue is only related to having too many SPP, so that the computation is internally split into multiple passes, which causes the aforementioned 16-32 GB peak memory usage, in both LLVM and Cuda, out of the blue.

To add motivation... If you were to fix this issue, it would open up many options to massively speed up the rendering... Because it turns out, splitting workload into multiple smaller passes, and forcibly evaluating the spectrum contribution from integrator speeds up the computation by factors of 2 - 4. With manageable memory cost, proportional to number of pixels * SPP per pass, which is just around 800 MB with rgb variants and 1Mpx image + 64 SPP per pass.

rtabbara commented 3 weeks ago

Hi @xacond00,

I believe this is expected behaviour. There was a discussion here that similarly covered what you've encountered and the answer is still relevant

Had a quick look at it this morning. It's a bit unfortunate but we need to evaluate and store the sampler's state between each pass, just in case it's some stratified sampler. That state can be surprisingly large, for example in an independent sampler it's represented by two 8 byte values per lane/thread: 2 8 (2 ** 31 - 1) ~= 68 GB

We should maybe have a special code path for the independent sampler as it really doesn't need to store its state between passes.

So while your solution in edit 3 may be fine for an independent sampler, more generally it may not be applicable.

xacond00 commented 3 weeks ago

So while your solution in edit 3 may be fine for an independent sampler, more generally it may not be applicable.

Yes I know that. But why doesn't the first pass and other computations cause any significant VRAM usage in that case ?

It's really unfortunate, because of this single quirk, the software in default configuration (without using depreciated options), won't basically run on anything less than professional grade GPU's. Not even 4090 in some cases. Would host side caching on linux be possible at all ? Or in the very least, querying of available memory, to break down the spp per pass automatically ?

Btw. in the current state, this applies to all samplers, not just non-independent, as you incorrectly changed in the title.

Angom8 commented 6 days ago

Hello ! I had this specific issue when i was handling adaptive sampling (and required multipass/non uniform SPP per pass). I worked on professional grade GPUs but it still was an important issue. I think it might not possible to reduce the consumption because of DrJIT's current implementation / loops. I would like to help or hear again about this issue if a fix is being worked one though.

xacond00 commented 2 days ago

Hello ! I had this specific issue when i was handling adaptive sampling (and required multipass/non uniform SPP per pass). I worked on professional grade GPUs but it still was an important issue. I think it might not possible to reduce the consumption because of DrJIT's current implementation / loops. I would like to help or hear again about this issue if a fix is being worked one though.

If you exclusively use independent sampler, you can use this workaround in the SamplingIntegrator:

 // Potentially render multiple passes
        for (size_t i = 0; i < n_passes; i++) {
            auto sampler = sensor->sampler()->fork();
            sampler->seed(i * 512, wavefront_size);
            render_sample(scene, sensor, sampler, block, aovs.get(), pos,
                          diff_scale_factor);

            if (n_passes > 1) {
                //sampler->advance(); // Will trigger a kernel launch of size 1
                //sampler->schedule_state();
                dr::eval(block->tensor());
            }
}

Ie. instead of advancing the sampler, just fork it with a new seed. Rendering different SPP per pass, you might have to also set wavefront size and SPP per pass in each fork, like moving the sampler setup code into the loop.

mitsuba-renderer / mitsuba3