vulkano-rs / vulkano

Safe and rich Rust wrapper around the Vulkan API
Apache License 2.0
4.45k stars 435 forks source link

Question: is there a benchmark for Vulkano where one thread renders and another thread loads content? #2529

Closed John-Nagle closed 3 months ago

John-Nagle commented 3 months ago

I'm using Rend3/WGPU, and I have a big performance problem. Multiple threads are loading content into the GPU in parallel with the render thread rendering. With no content loading in progress, I get 60 FPS. With content loading in progress, that can drop to 10 FPS. That's awful. I need to know if switching from WGPU to Vulkano can fix that.

The application is a metaverse client, Sharpview. 100% safe Rust. Here are some test videos.

Content is constantly loaded from servers as the player travels around a big world with petabytes of content. The whole world cannot possibly fit in the GPU all at once. So there's a huge amount of GPU data loading required. That's where Vulkan is supposed to perform well.

The goal of all this is to get Second Life/Open Simulator, which has historically been sluggish, to perform like an AAA title. The older clients use OpenGL, and they are bottlenecked on the main thread.

Sharpview is highly concurrent. The render thread does nothing but render. Content loading threads run in parallel. The content loading threads shouldn't kill the render thread performance, but they do. The performance problems seem to be 1) WGPU has lock conflicts, and 2) WGPU only uses one Vulkan queue. This has been verified with Tracy profiles, and the WGPU devs admit this is a problem. Vulkan itself should have the needed performance, but WGPU introduces bottlenecks.

So that's the problem.

I have a benchmark, render-bench. This demonstrates render thread slowdown with one render thread and one content creation thread. Is there a similar benchmark for Vulkano I could use for comparison? Something where content is loaded at high rates from other threads? Thanks.

marc0246 commented 3 months ago

To answer the question, we don't have a benchmark like that and neither do I personally nor know of any. However we do have an example of async updates which you might find interesting. It's not the same thing, as there is no constant flow of data, however it's designed so that you can visually see the lag of loading a big chunk of a texture which isn't affecting rendering performance.

As for Vulkan's benefits, you are correct. All dedicated desktop GPUs as far as I'm aware have a queue family dedicated to transfers and usually one for compute as well, in addition to the one mainly for graphics. I want to stress the "family" part, because that's what guarantees entirely disconnected hardware. So if your Vulkan implementation reports a queue family for transfers (and usually sparse binding) only, that has its own DMA hardware. The queues in these families are mostly software/symbolic, not actual hardware. It allows the Vulkan implementation to schedule different submissions out of order and with different priorities if you use multiple, but it will use the same hardware. On UMAs however, there usually isn't a dedicated transfer queue (nor async compute) which makes sense from an energy-efficiency perspective as well as the fact that there's only one memory domain and no PCIe bus, so it's questionable if there would be a benefit to begin with.

This isn't Vulkan's only advantage however. There's also the fine-grained (device-side) synchronization that you can do, unchecked command recording, as well as full control over memory allocation, to name a few.

But don't get your hopes up too much. When it comes to vulkano, it currently has huge inefficiencies both host and device-side. There's a lot of tech debt stemming from vulkano's long history, which dates back to the Vulkan release when no one knew how to design these APIs. With the benefit of hind-sight, I'm currently working on rewriting all synchronization to a task graph which is the industry standard solution. That will allow us to replace all the owned collections with references alleviating the host-side inefficiencies, and allow you to actually do fine-grained and explicit synchronization alleviating the device-side inefficiencies. That's the plan for v0.35.

Also, if you're looking for a 100% safe wrapper then this isn't it, despite what the readme would lead you to believe (we really need to update it). The runtime of shaders is not validated and there are no plans for that. Only things running on the host are validated (unless you use the unchecked functions).

John-Nagle commented 3 months ago

Thank you very much for the honest answer.

When it comes to vulkano, it currently has huge inefficiencies both host and device-side. ... I'm currently working on rewriting all synchronization

Good to hear that there is work underway. Please keep plugging away. It's hard.

I'm mostly targeting gamer PCs. Steam, the game distribution system, tracks this. Currently, the top 15 graphics cards used by Steam gamers are all pretty good NVidia cards, the NVidia 3060 being most popular. All of those have enough power to run Vulkan/Vulkano with concurrent asset loading. At this moment, 28 million users are logged into Steam, and they claim 120 million active users. So that's the target market for high-performance gaming. That's where Unreal Engine shines.

As yet, no Rust graphics stack can use such hardware effectively. Everybody bottlenecks on the main thread. This is a major drag on Rust game dev.

marc0246 commented 3 months ago

I wouldn't say "no stack", as there are game engines that don't use wgpu but Vulkan such as Kajiya.

John-Nagle commented 3 months ago

Kajiya looks nice. But they write "Kajiya does not currently aim to be a fully-featured renderer used to ship games, support all sorts of scenes, lighting phenomena, or a wide range of hardware. It's a hobby project, takes a lot of shortcuts, and is perpetually a work in progress." Probably not a good idea to try to use that for something else.

There's also a sailing game which calls 'good old DX11", as the developer says, directly. Windows-only, of course.