What is the correct way to present a high-resolution gpu-computed image with minimal overhead?

r0levrai commented 1 month ago

Taichi noob here :D

Is there a way to directly write to a render target / framebuffer from a taichi kernel with a constrained format (or even writing to a provided data structure)? Some kind of fast_gui=True "I don't need any conversion/copy/clear" option for ggui?

Background

Let's say I want to show a dead simple kernel like this one from the taichi python examples:

@ti.kernel
def render(t: float):
    for i, j in img:
        a = ti.Vector([i / res[0], j / res[1] + 2, i / res[0] + 4])
        img[i, j] = ti.cos(a + t) * 0.5 + 0.5

...at native resolution and framerate (2560x1440@165fps), which should be a walk in the park for the laptop 3070 I'm testing this on. However,

Using the old gui in zero-copying frame buffer mode (fast_gui=True), the presentation takes 36ms with 3 different bottlenecks (line profile detail: cuda, vukan). Not sure but seems like a cpu-side conversion, copy and clear.
Using ggui, it seems the presentation still takes about 12ms (twice my entire frame budget) in canvas.set_image(img) and window.show() (line profile detail: cuda, vukan). I'm not sure why this take so much time since I think we're on gpu-land here.
Using an external more game-oriented presentation layer might be another option?

repro: test_taichi.py

related: #8466 #5438 #4922

Note that this is easy to run into when first learning/evaluating taichi from a gamedev/game engine - adjascent background (the first chapters of the manuals talk a lot about performance, so the first thing that come to mind is upping the minuscules hello world resolutions to see how the shaders scale there), and running into presentation api bottleneck at this point can be pretty confusing. Would be glad to propose a PR for the documentation if I get this figured out.

oliver-batchelor commented 3 weeks ago

I have also been searching for a way to do this, so I did some digging just now. I think we can use an OpenGL library e.g. moderngl to create a VBO then use something like this to map the VBO to a cuda array (and then wrap the cuda array in pytorch so it can be used with taichi directly).

https://gist.github.com/stgatilov/0bb58bf5296c3dfabd2eecd8dbf42237

r0levrai commented 3 weeks ago

Wow! I didn't dig around this much, but hoped taichi would have an API for mapping to a VBO with less code and more backend coverage. But thank you very much, that looks like a promising start!

oliver-batchelor commented 3 weeks ago

Seems like Taichi ImGui uses Vulkan, so I would guess there's some way to build such a thing in - i.e. draw a cuda image directly.

To be honest I'm more keen on figuring out how to do it generically for OpenGL as I'm using Qt for our application(s) and I could draw on the Qt OpenGL widget.

I'm keen to hear how you get on - if I have a chance, I'll play with this at some point, but it's quite low on the priority list!

taichi-dev / taichi

What is the correct way to present a high-resolution gpu-computed image with minimal overhead? #8590

Background