speedups and use matrix operations vs dot and multiplys

tigrazone commented 2 months ago

Simplified version of bsdf/brdf eval/sample not changed because this code not used

tigrazone commented 2 months ago

On gpu separate 3 dots works slower then one mat3 multiplication. I make tests with -test -test-frames and I see my variant is faster up to 12%: this PR 28s, original commit 32s

NBickford-NV commented 2 months ago

Hi @tigrazone ! Thanks for this pull request. I think this is a generally good idea, and gets this code closer to how PBRT evaluates materials, for instance.

I think it would be even better if we could remove PbrMaterial::T, PbrMaterial::B, and PbrMaterial::N entirely, and replace them with PbrMaterial::TBN[...]. I worry that having the duplicate copy of the tangent, bitangent, and normal in this struct causes additional register usage.

Incidentally, the performance improvement you're seeing may be caused by changes in how the compiler decides to perform register allocation, and only indirectly caused by the changes to these mathematical operations. Small floating-point matrix multiplications generally reduce to a series of multiplications and fused multiply-add operations, the same way that dot products do. You might be thinking of Tensor Cores for large matrix multiplications.

One way you can observe this is by using Nsight Graphics' GPU Trace. If you capture a frame of the nvpro-samples master branch of vk_gltf_renderer and look at the Shader Pipelines section, you'll see that the compute shader that performs ray tracing uses 204 registers, 0 bytes of shared memory, and so can only have 8 warps running at a time:

In your branch, the number of registers decreases to 128 (which is good), but now there are 23552 bytes of shared memory allocated, and the net result is that 16 warps can run at a time:

You can sort of infer that the compiler's moved some registers to shared memory. Shared memory is slower to access than registers, but sometimes spilling to shared memory improves performance by increasing the number of warps that can run at once -- and the performance improvements here are likely due to this rather than directly by the math changes. Spilling (which Martin-Karl also mentioned) isn't always good, but here it happens to be.

You can also see this empirically; as you add or remove parts of your Git diff, you can see performance change between when the optimizer decides to spill registers and when it doesn't.

If this is due to the increased size of the PbrMaterial struct, it's possible performance might drop down once you remove PbrMaterial::T, B, and N. If so, I'd say don't worry about it too much -- we can work on improving the register usage of the code as a whole, and in the meantime it turns out that the RTX pipeline (you can change the pipeline type in Renderer/Pathtracer > Render Mode) appears to have better register usage on FlightHelmet.gltf.

Thanks!

tigrazone commented 2 months ago

One more way to optimize code - is not to copy parts of calculated to variables by to_local and compare parts of vec3 like this:

  const float k1h = K1K2h.x;
  const float k2h = K1K2h.y;

  // nk1 and nh must not be 0.0f or state.pdf == NaN.
  if(nk1 <= 0.0f || nh <= 0.0f || k1h < 0.0f || k2h < 0.0f)

can be rewriten by

  // nk1 and nh must not be 0.0f or state.pdf == NaN.
  if(nk1 <= 0.0f || nh <= 0.0f || K1K2h.x < 0.0f || K1K2h.y < 0.0f)

because copy by value COSTS in speed and local memory

one more way to solve it is #defines:

#define k1h (K1K2h.x)
#define k2h (K1K2h.y)

  // nk1 and nh must not be 0.0f or state.pdf == NaN.
  if(nk1 <= 0.0f || nh <= 0.0f || k1h < 0.0f || k2h < 0.0f)

and #define way is not ok for return operators and to solve we can use strange variable names like _k1h_ vs k1h for remove collisions of var names in real code

tigrazone commented 2 months ago

remove PbrMaterial::T, PbrMaterial::B, and PbrMaterial::N is good idea. We can use completely PbrMaterial::TBN

mklefrancois commented 2 months ago

Hi @tigrazone, thanks again for your pull request.

The unexpected result it produced has prompted some internal investigation on our part. We're currently investigating the implications of these findings. Your post has highlighted an area that needs further investigation. We'll get back to you with more information.

If you have any additional insights, please share them with us.

Thank you again for bringing this to our attention.

mklefrancois commented 2 months ago

Hi @tigrazone, I made an update to vk_gltf_renderer so that the benchmark only counts the frames rendered by the raytracing and not the frames rendered by the application. This avoids including the frames when loading a scene for example.

With this update, can you run with the following command (enable test for 5000 iterations)

"E:\temp\test.gltf" -test -test-frames 5000

This should give you a report similar to the one below:

Timer null;      N/A      0; CPU   1427;
 Timer Raytrace;         VK    1285; CPU     30; (microseconds, avg 128)
 Timer Tonemapper;       VK       8; CPU     17; (microseconds, avg 128)
Number of frames: 5000
Testing Time: 8.937 s

The important information is the time spent on the GPU for raytracing. In the above report, the average time for one frame is 1285 microseconds. The total test time is irrelevant because it includes loading, warming up, etc. The lower that number is, the better it is as it is spending less time rendering a frame.

I ran several tests in different conditions; view from outside and view from inside, with and without your changes, but the result varies less than +-1%.

I ran the tests on a RTX Quadro 3000 (laptop) and on a RTX 6000 Ada (desktop) without seeing any improvements.

Could you please share your results, before and after, the graphics card information and the driver version and the scene you are using, because all those factors can influence the results.

Note: one other thing to take into account is to verify if the GPU throttles or not. This can be viewed in the NVML Monitor, and if it is throttling, there will be a red indicator.

If it is the case, open a terminal in admin mode, copy a lower graphic clock mode like showed below.

tigrazone commented 2 months ago

Rtx 3060. Windows 10 PR 33s before PR 40s scene kitchen

NBickford-NV commented 2 months ago

Thanks @tigrazone! My guess is that if it's due to a register spilling compiler optimization, then it's probably architecture-specific -- that's why we saw the performance improvement on the RTX 3060 and 3090 (Ampere), but Martin-Karl didn't reproduce the performance improvement on the RTX Quadro 3000 Mobile (Turing) or RTX 6000 Ada (Ada). We're continuing to investigate the compiler's behavior here internally.

mklefrancois commented 2 months ago

Hi @tigrazone, could you please check if you see a similar gain with the RTX pipeline, or they stay mostly the same.

It would be useful if you could provide the information in microsecond using the -test -test-frames 1000

tigrazone commented 2 months ago

I use indirect(ray_query) mode. It is faster and more useful then rtx(raytracing pipeline)

mklefrancois commented 2 months ago

If it isn't too much trouble, could you kindly provide the requested information, as it will be very helpful in tracking down the source of the issue.

tigrazone commented 2 months ago

Sometimes shaderc compile too much time or compiled code works not always with same speed. Precompiled spv/h files was a better approach. I dont understand why Nvidia start to use shaderc and slang DLLs to compile code

tigrazone commented 2 months ago

RTX mode. 1000 frames PR Timer Raytrace; VK 12862; CPU 75; (microseconds, avg 128) Timer Tonemapper; VK 46; CPU 20; (microseconds, avg 128) Number of frames: 1000 Testing Time: 16.795 s

before PR Timer Raytrace; VK 14786; CPU 74; (microseconds, avg 128) Timer Tonemapper; VK 47; CPU 18; (microseconds, avg 128) Number of frames: 1000 Testing Time: 19.236 s

NBickford-NV commented 2 months ago

Whether we use shaderc or glslangValidator actually depends on whether each sample compiles shaders at runtime or at build-time. Vk_gltf_renderer uses shaderc because it actually supports hot-reloading shaders -- if you press Ctrl-R, it'll recompile its shaders from disk, which is useful if you're making lots of changes and need to iterate quickly (this might be especially helpful for trying out performance optimizations). Other samples use glslangValidator (and you'll see CMake calling into glslangValidator to find shader dependencies for those samples at configuration time).

tigrazone commented 2 months ago

after check with precompiled shaders version I found my PR is only 10-15% faster

mklefrancois commented 2 months ago

That the precompiled shader runs faster is interesting. In summary, the speedup was only seen on Ampere, which is actually a slowdown with the current code because the compiler does it right on the other architecture. The slowdown of the original code on this particular architecture is due to an unexpected driver behavior that we are currently investigating. Thanks for the report.

tigrazone commented 2 months ago

I used for shader compiler -g0 -Os vs default -g Meybe because I use optimizations for shader it works faster

nvpro-samples / nvpro_core

speedups and use matrix operations vs dot and multiplys #64