vsg-dev / VulkanSceneGraph

Vulkan & C++17 based Scene Graph Project
http://www.vulkanscenegraph.org
MIT License
1.23k stars 197 forks source link

Headless rendering much slower on AMD GPUs? (30x slower) #1208

Open drywolf opened 1 month ago

drywolf commented 1 month ago

Describe the bug I am using VSG to perform some headless rendering (i.e. no Swapchain and no vsg::Window) The code that I am using is very similar to vsgheadless.cpp from the vsgExamples.

Also when looking at the Windows Task-Manager GPU performance metrics, there is an interesting difference between the two GPUs:

AMD RX 5700 XT

NV RTX 2080 TI


I already tried to do some profiling on the AMD to find out what is happening, but all of the AMD profiling tools are failing to function. This is the minimal code to reproduce what I showed above: https://gist.github.com/drywolf/690c775bb181c946b30ed67ebcdee3de

PS: the minimal code does not render anything, it only contains a single RenderPass that would implicitly clear the color & depth-stencil images, but that is all the code is doing. So it is quite surprising to see the low FPS / high Copy load on the AMD card, for such a trivial minimal workload.

martinweber commented 1 month ago

I am seeing an even higher load of 74% on the Copy queue. This is with an AMD RX 6700 XT, 16 GB on a dual monitor setup (4K and 1440p resolution respectively). Framerate is at about 235 fps, so even worse.

Screenshot 2024-06-05 163842

robertosfield commented 1 month ago

I think it's important to differentiate between rendering on the GPU and copying of data to and from the GPU over the PCI express bus.

I presume this thread is actually about copying rather than rendering so the title of this thread is most likely misleading, is this so or am I have just reading things wrong?

There isn't any information above about the amount of data being transferred and what mechanism is being used.

With unexpected differences in performance between hardware/drivers sometimes Vulkan errors have occurred that one hardware/driver combination copes just fine with but others ended up slowing down. Running the application with Vulkan validation layer on would be useful test to make sure there are no issues that need fixing.

As a general comment, when writing in English it's best to stick with English language conventions on numbers, so a . is a decimal place, not a deliminator between thousands. A German convention of 13.000 in an English language text will be read as 13, not 13 thousands. Having to second guess what folks might mean by what they write just takes away from the bandwidth required to understand the actual problem in hand.

drywolf commented 1 month ago

I apologize if I was unable to communicate the issue at hand clearly enough.

I think it's important to differentiate between rendering on the GPU and copying of data to and from the GPU over the PCI express bus.

I presume this thread is actually about copying rather than rendering so the title of this thread is most likely misleading, is this so or am I have just reading things wrong?

There isn't any information above about the amount of data being transferred and what mechanism is being used.

That is the curious thing here. We were seeing worse performance on AMD GPUs than we would have expected (by taking a rough guess based on the hardware specs) So we started to reduce our VSG code down to something more minimal to isolate where the AMD Driver/GPU might be doing something wasteful.

We now ended up with the minimal example code that I mentioned above, and it is still showing the same low performance & unexpected COPY in Task-Manager that we were seeing in our production app: https://gist.github.com/drywolf/690c775bb181c946b30ed67ebcdee3de

Translating this code to OpenGL for example, would be similar to a render-loop that is just doing glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) and nothing else !!

So the framerate should be very high, and there should not be any GPU memory-copies happening ... because this is doing headless rendering, there should also be no VK Swapchain involved in any way.

That is why the COPY workload in the Task-Manager & the low FPS on AMD are so surprising & unexpected. There is no Vulkan/VSG code that I could see responsible for this copy-overhead.

@robertosfield So I wanted to ask if maybe you have stumbled across something similar related to AMD GPUs while working on VSG / other Vulkan projects?

Thanks

robertosfield commented 1 month ago

I had a quick look a the example and nothing jumps out as possible cause of slower rendering. I'm really busy with other VSG work right now so I'm not able to go test out the example as is, perhaps others can test out to get a feel for how things perform on different hardware/OS/driver combinations.

Do any of the standard VSG example exhibit the same performance issue?

As a general comment, I've been developing on Linux mostly when writing the VSG, using either AMD5700G integrated GPU or a Geforce 1650 and 2080 cards. I've also got an Intel laptop and desktop and use the integrated GPU on these. Mostly I'm seeing really consistent performance across the board.

The integrated GPUs show lower cost of copying data from GPU associated memory into CPU associated memory than on the dedicated GPUs.

The NVidia cards list more queue options, but that's down to their drivers, this can provide extra options for lowering the cost of copy, but generally I've found the AMD side to have lower copy cost but it's on integrated GPU so it's comparing apples to oranges. As I don't have a dedicated AMD card I can't say how the dedicated AMD card would perform.

Vulkan and VSG support GPU timing stats, with the vsg::Profiler supporting both GPU and CPU stats collection so perhaps this is something to try out when profiling how the application is running. The vsg::Profiler can output it's result to console/file after the collection phase so I've used to a few times to figure out cost of different parts of the work.

I would also recommend trying the same tests across different OS's and hardware/driver combinations.

Mikalai commented 1 month ago

@drywolf Similar behaviour for me image

Mikalai commented 1 month ago

@drywolf Same hardware but on fedora 40 performs much better image

drywolf commented 1 month ago

Thanks @Mikalai for testing ❤️ Fedora behaving so differently might indicate an issue in the AMD Windows driver. I will contact AMD and let them know about this.

drywolf commented 1 month ago

@Mikalai the last time I worked with AMD GPUs on Linux there were two different kinds of drivers, the open-source driver and the proprietary "ROCm" driver. Which one of these are you using on your Fedora Linux? Also the exact driver-version would be of help when I report this to AMD. Thanks 🙏

Mikalai commented 1 month ago

@drywolf vulkaninfo --summary reports this

Devices:
========
GPU0:
    apiVersion         = 1.3.274
    driverVersion      = 24.0.8
    vendorID           = 0x1002
    deviceID           = 0x73a5
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = AMD Radeon RX 6950 XT (RADV NAVI21)
    driverID           = DRIVER_ID_MESA_RADV
    driverName         = radv
    driverInfo         = Mesa 24.0.8
    conformanceVersion = 1.3.0.0
    deviceUUID         = 00000000-0a00-0000-0000-000000000000
    driverUUID         = 414d442d-4d45-5341-2d44-525600000000
GPU1:
    apiVersion         = 1.3.274
    driverVersion      = 0.0.1
    vendorID           = 0x10005
    deviceID           = 0x0000
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 18.1.1, 256 bits)
    driverID           = DRIVER_ID_MESA_LLVMPIPE
    driverName         = llvmpipe
    driverInfo         = Mesa 24.0.8 (LLVM 18.1.1)
    conformanceVersion = 1.3.1.1
    deviceUUID         = 6d657361-3234-2e30-2e38-000000000000
    driverUUID         = 6c6c766d-7069-7065-5555-494400000000
drywolf commented 1 month ago

@Mikalai To me this looks like the Mesa RADV Vulkan Driver ... i.e. this is not the driver developed by AMD, but a driver that is developed by the Linux community AFAIK.

driverID = DRIVER_ID_MESA_RADV

The official (proprietary) AMD driver would be showing something like:

driverID = DRIVER_ID_AMD_PROPRIETARY

drywolf commented 1 month ago

@robertosfield

Do any of the standard VSG example exhibit the same performance issue?

I am in the process of setting up a more complete repro-case, and now I also recreated the issue with a windowed vsg example code. At first the COPY workload was not as noticable, because by default VSG would limit the framerate to 60 FPS. But then I uncapped the framerate via windowTraits->swapchainPreferences.presentMode = VK_PRESENT_MODE_IMMEDIATE_KHR;

With that I am getting:

on AMD RX 5700 XT

on NV RTX 2080 TI

The VSG code is basically the vsghelloworld.cpp example, but without rendering any 3D scene. (so just a window with a clear color, and nothing else)

drywolf commented 1 month ago

I now created a self-contained Github repo that contains the same code for headless/offscreen VSG rendering that I already posted above.

https://github.com/drywolf/vsg_amd_perf (this is using vcpkg to fetch VSG, so there should be little to no extra effort needed to build this)

Additionally I also added another minimal VSG app that is rendering to a vsg::Window / Swapchain.

PS: The Windowed-App is now also working with some of the AMD profiling tools. I only had a first quick chance to do some profiling, but at first glance these tools are not showing me any obvious Copy workloads / bottlenecks. But the framerate in this App is similarly low as seen above in the Offscreen/Headless rendering tests.

robertosfield commented 1 month ago

Another thing you could look at is whether the windowing system is doing compositing in which was the application is rendering of a buffer that is then used by the compositor as input. Fullscreen without window decoration should bypass the compositor but this will be down to the OS/drivers to implement properly.

I'll have to defer to Windows devs to give guidance on how to control the Windows desktop composition and driver settings as I'm only an occasional Windows user with no platform expertise on the platform.

drywolf commented 1 month ago

Another thing you could look at is whether the windowing system is doing compositing

I disabled all Windows 11 advanced compositing options (following this guide) and ran the app in fullscreen mode, by setting windowTraits->fullscreen = true This made no FPS difference on AMD ... the 3D & Copy workloads remained pretty much unchanged (3D ~20% ... Copy ~64%, framerate still low at around 350-360 FPS)

robertosfield commented 1 month ago

Another variable you could experiment with is different formats for the colour and depth buffers, perhaps the defaults chosen by the VSG are tripping up the driver into a slower path on this particular hardware/driver combination.

drywolf commented 1 month ago

Yeah that's a good idea 👍 I already did this yesterday with VK_FORMAT_R8_UNORM ... it was giving me the same results as for the default VK_FORMAT_R8G8B8A8_UNORM But I will try some other formats as well, just in case there might be some insight to be gained.

PS: the offscreen_perf app in the example repo now is only using a color-attachment, and no depth-attachment anymore. So the issue is still present without any depth-stencil rendering. (for the offscreen/headless case)

drywolf commented 1 month ago

I now tried a couple more VkFormats, and none of them showed any significant difference in performance.