mitsuba-renderer / drjit

Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering
BSD 3-Clause "New" or "Revised" License
563 stars 40 forks source link

Real-Time Renderering - Frame to Screen Data Transfer #206

Closed jakobtroidl closed 9 months ago

jakobtroidl commented 9 months ago

I am trying to use drjit to build a forward-only renderer that should run in real-time after being inspired by this amazing new paper.

I wonder how to best transfer frames rendered by a kernel to the screen as fast as possible. I am first converting the data into a numpy array and then render it using the PyGame window library - this runs at ~20 FPS on my LLVM-accelerated Intel MacPro (see code below). This seems inefficient because there's a lot of unnecessary data transfer (GPU->CPU->GPU for CUDA backend) happening for each frame. Do you have recommendations on more clever design choices when using drjit for real-time rendering?

pip install drjit numpy
python -m pip install -U pygame==2.5.2 --user
import pygame
import numpy as np
import drjit as dr
from drjit.llvm import Float, UInt32, Array3f, Array2f, TensorXf, Texture3f, PCG32, Loop

def sdf(p: Array3f) -> Float:
    return dr.norm(p) - 1

def trace(o: Array3f, d: Array3f) -> Array3f:
    for i in range(10):
        o = dr.fma(d, sdf(o), o)
    return o

def shade(p: Array3f, l: Array3f, eps: float = 1e-3) -> Float:
    n = Array3f(
        sdf(p + [eps, 0, 0]) - sdf(p - [eps, 0, 0]),
        sdf(p + [0, eps, 0]) - sdf(p - [0, eps, 0]),
        sdf(p + [0, 0, eps]) - sdf(p - [0, 0, eps])
    ) / (2 * eps)
    return dr.maximum(0, dr.dot(n, l))

def render_sphere():
    x = dr.linspace(Float, -1, 1, 1000)
    x, y = dr.meshgrid(x, x)
    p = trace(o=Array3f(0, 0, -2), d=dr.normalize(Array3f(x, y, 1)))
    sh = shade(p, l=Array3f(0, -1, -1))
    sh[sdf(p) > .1] = 0
    img = Array3f(.1, .1, .2) + Array3f(.4, .4, .2) * sh
    img_flat = dr.ravel(img)
    return TensorXf(img_flat, shape=(1000, 1000, 3))

# pygame setup
pygame.init()
screen = pygame.display.set_mode((1280, 720))
clock = pygame.time.Clock()
running = True
dt = 0

# Create a Pygame surface from the array
clock = pygame.time.Clock()
font = pygame.font.Font(None, 36)

while running:
    # poll for events
    # pygame.QUIT event means the user clicked X to close your window
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            running = False
    screen.fill("purple")

    array = render_sphere()
    data = np.array(array)
    data = data * 255
    data = data.astype(np.uint8)

    surface = pygame.surfarray.make_surface(data)
    screen.blit(pygame.transform.scale(surface, (1280, 720)), (0, 0))

     # Calculate and display FPS
    fps = clock.get_fps()
    fps_text = font.render(f"FPS: {fps:.2f}", True, pygame.Color('white'))

    screen.blit(fps_text, (50, 50))

    pygame.display.flip()
    dt = clock.tick(150) / 1000

pygame.quit()
njroussel commented 9 months ago

Hi @jakobtroidl

Out-of-the box, I'd say that there are a few things missing in Dr.Jit to really squeeze out every bit of performance of something like this.

As you suggested, in your current approach there is some overhead of the data transfer. Fundamentally, this problem lies on the display tool/framework you want to use: find on which will accept CUDA arrays through the dlpack interface. (There are some ongoing discussions about the necessity to synchronize when using that interface https://github.com/mitsuba-renderer/drjit/issues/198). Alternatively, the Texture2f/Texture3f classes use CUDA textures, I'd expect some framework to maybe accept these directly. However their respective handles aren't exposed through Python so you'd need to add that yourself.

Another big overhead is the tracing runtime of executing the Python interpreter through your render() function. Although Dr.Jit will cache its kernels and re-use them, it still has to "read" through your code entirely to realize that you're executing some piece of code that it has already seen. There isn't much you can do to alleviate this.

dvicini commented 9 months ago

I think it should still be possible to access the underlying Dr.Jit evaluated CUDA buffers directly using the data_() call. E.g.,

a = dr.linspace(dr.cuda.Float32, 0, 1, 1024)
print(a.data_())

If I recall correctly, this will be a simple pointer directly to the CUDA memory. So if you then have a CUDA kernel that draws that to screen/a texture, that might be quite fast. I agree with Nicolas that the tracing is likely of significant overhead.

jakobtroidl commented 9 months ago

Thank you so much for your answers.

Another big overhead is the tracing runtime of executing the Python interpreter through your render() function.

I am wondering how I could work around this issue. Would it make sense to compile CUDA kernels once from a Python implementation and then invoke them from a C++ based rendering loop? There must be a way around this issue since the paper mentioned above is so incredibly fast and it seems like they implemented their forward pass in drjit.

wjakob commented 9 months ago

This paper uses a custom Dr.Jit version with many project-specific modifications. It's our goal to make something like this possible in mainline Dr.Jit in the future. Right now it is not possible due to the tracing overheads mentioned above. It will take a long time, you may want to pursue other other options if your goal is to do this right now.

jakobtroidl commented 9 months ago

ok, thanks for the heads-up.