Performance with Intel surfaces rendered with Intel GPU to Nvidia output

ids1024 commented 1 year ago

It seems good to document what I've seen with performance somewhere.

I've been seeing about 20fps with es2gears on a 1440p monitor. On a 1650 mobile with the 545.23.06 driver.

Looking at Tracy profiling, using https://github.com/Smithay/smithay/pull/1134 with some changes:

2023-10-30T17:08:39,229664208-07:00

Context 0 in this case is the Nvidia GPU. The portion of render before clear seems to be the wait at https://github.com/Smithay/smithay/blob/4f9480e5e02d05379e14ec49e1135cc8a57275d1/src/backend/renderer/multigpu/mod.rs#L1144. Without that performance is improved but still less that 60fps (and there's artifacting, of course).

Both GPUs spend about 10ms here drawing in render_texture_from_to.

With Intel rendering to an Intel target (but just at 1080p), it only takes about 967μs. Forcing a linear modifier raises that to around 2-5ms, seemingly with more variation frame to frame. I guess that's just (multi-level) caching? So rendering to a linear modifier is part of the performance difference here, though presumably that should be fine, and normal for reverse PRIME like this...

It is strange though to see draw_solid in GlesFrame::clear taking 3.85ms. That should be fast and just involve the Nvidia framebuffer, which should be efficient to render to...

ids1024 commented 11 months ago

Also seeing bad performance with Nvidia surfaces on an Nvidia output, strangely. Even when modifying the compositor to only use the Nvida GPU. It blocks on EGLFence::wait, but removing the wait call doesn't improve fps significantly, only changing where it's blocked.

Performance seem to vary in OpenMW depending on what the camera is pointed at, but not just based on how much of the screen in changing. So maybe render calls are slow on the client for some reason? (So the impact depends on what it's rendering.)

I don't think another GPU or the CPU should be accessing the buffer in this case. Using 8 bit color instead of 10 bit doesn't help. It is using an Nvidia modifier, though I don't know if it's the most optimal Nvidia specific modifier. But presumably Nvidia's egl-wayland would choose a reasonable one out of those offered...

flukejones commented 10 months ago

Per my comment in #281 above, this seems related to P states. The P state in both 535 and 545 are all over the show. When the P state is forced to P0 everything becomes smooth.

A test I have done is running vkcube, then drag by the titlebar (whatever distance) and don't let go after movement stops - the animation becomes super smooth in both nvidia-open, nvidia-proprietary, 535, and 545. For me it seems to flutter between P4 and P5 unless forced. And apparently some games are impacted also.

This affects only the nvidia outputs, I guess the buffer slinging needs much more speed compared to when the internal display is used.

A cursory sidenote: I think the internal screen is affected by the same issue if the external out is connected.

flukejones commented 10 months ago

Hmm, kernel 6.7.1 seems to fix the issues I was having with nvidia p-states. However the click/drag issue still applies.

Edit: false alarm. P-state flutters between P0, P4, P5. Causes stuttering.

flukejones commented 10 months ago

6.7.1 on all 3 of my asus rog machines, one seems to have no p-state issue, while the others do. Same install etc.. not sure what's going on. I'm about to test the 550 nvidia driver and will report back

flukejones commented 10 months ago

I'm bewildered. So I might be confusing the issue a bit. As stated above the click-drag-hold with glxgear/vkcube still exists with 550 nvidia driver.

Almost all games run very very well for me. With the exception of Quake re-release which runs like a bucket of dried poo at higher res. Quake II re-release appears to run super well. Cyberpunk also seems much improved but requires v-sync for smoothness.

Hmm.. cosmic-comp is excellent. The improvement over gnome is honestly a bit absurd. The difference between 545 and 550 driver... I can't really tell, cosmic does feel much better but that could well be a placebo effect for me. I tested with a ROG Strix dgpu ouputs, and a ROG X16 plus EGPU (XG Mobile) outputs.

ids1024 commented 10 months ago

A test I have done is running vkcube, then drag by the titlebar (whatever distance) and don't let go after movement stops

Ah. So that would be related to the heuristic cosmic-comp tries to use to device which GPU to composite on. (Which could probably be improved.)

The Wayland protocol for GPU buffers also probably needs some improvement to better handle multiple GPUs. (Like https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/268)

Anyway, you should see it work better if the client is started with WAYLAND_DISPLAY=wayland-1-renderD129. (Assuming renderD129 is the Nvidia GPU), and the window is on the Nvidia output.

flukejones commented 10 months ago

The linked protocol extension looks pretty important for hybrid. But my understanding of how nvidia (open and proprietary) work - don't they work slightly different to usual or did that all change once nvidia finally supported gbm? I'm a bit behind the times on this.

The drag thing seems to be no-longer relevant. As for games.. Cyberpunk is stuttery without v-sync, not sure where to go with that. Since it's in Xwayland it's a bit different.

Running __NV_PRIME_RENDER_OFFLOAD=1 vkcube --present_mode 0 sends p-state to P0 since it's not v-sync limited now, and the hiccups from constant p-state change are gone. Also not sure where to go with that. If you have some hints on where to file bug reports I'd be happy to do so.

Edit: additional info:

game + no v-sync + FPS over refresh rate = smooth
game + no v-sync + FPS under refresh rate = stutters every few seconds (p-state change?)
game + v-sync + FPS over refresh rate = smooth
game + v-sync + FPS under refresh rate = smooth (and possible input lag?)

Quake II RTX with switching renderers between RTX and openGL is a good test case.

ids1024 commented 10 months ago

But my understanding of how nvidia (open and proprietary) work - don't they work slightly different to usual or did that all change once nvidia finally supported gbm? I'm a bit behind the times on this.

Yep, that's changed. Nvidia eventually decided to stop persuing EGLStreams for Wayland, and now supports GBM (unless your card is is too old to be supported by the current drivers, like GT 700 cards). They are now using the same dmabuf protocol that Mesa uses. That part of the driver is in https://github.com/NVIDIA/egl-wayland.

So if the dmabuf protocol needs improvement, hopefully Mesa and Nvidia drivers will both make use of it.

flukejones commented 9 months ago

Anyway, you should see it work better if the client is started with WAYLAND_DISPLAY=wayland-1-renderD129. (Assuming renderD129 is the Nvidia GPU), and the window is on the Nvidia output.

For some things this does indeed have a very heavy impact. But it also seems that things running in xwayland are not so impacted by it. A quick test for me was using wgpu examples - __NV_PRIME_RENDER_OFFLOAD=1 WAYLAND_DISPLAY=wayland-1-renderD129 WGPU_ADAPTER_NAME=nvidia cargo run --bin wgpu-examples water the difference is 400fps vs 1400fps

For glxgears and vkcube they run at full refresh but are still stuttering occasionally. I didn't try vkcube-wayland as that locked up cosmic last time I tried.

I just installed KDE6 also for testing and this is quite a stark difference. KDE-5 was terrible, but KDE-6 is like putting on a clean pair of silk underpants, nvidia outputs are very smooth, and so are games that run under the refresh rate. I'm unsure if kde-6 is forcing a vsync however. But this smoothness is for all the examples above.

Edit: the key difference so far is that under kde-6 the p-state for nvidia is always P0. On other desktops (cosmic, kde5, gnome) it fluctuates P0/P4/P5. And every time the P state changes it gets a jank/stutter.

Drakulix commented 9 months ago

For some things this does indeed have a very heavy impact. But it also seems that things running in xwayland are not so impacted by it.

This can't have an effect on Xwayland apps, as it affects the wayland connection, which X clients no nothing about. So until we run one Xwayland instance per gpu, using the PRIME-env variables is the best you can do for Xwayland apps. (Or running them through another rootful Xwayland instance or gamescope launched with our custom socket, which also circumvents the issue.)

The goal is ultimately to do just that and set the appropriate environment variables accordingly through our launcher and other means of starting applications on cosmic.

flukejones commented 9 months ago

@Drakulix I apologize for not posting in quite the right issue, but in the interests of keeping data points in at least a reasonable place with other points.. something I did notice:

WAYLAND_DISPLAY=wayland-1-renderD129 __NV_PRIME_RENDER_OFFLOAD=1 cargo run --bin wgpu-examples -- shadow

this runs very well if there are no xwayland windows on the screen. And a side effect is that all cosmic windows disappear while the example is running - other wayland windows do not. Frame timings go from (no env vars):

[2024-01-30T20:35:41Z INFO  wgpu_examples::framework] Frame time 1.52ms (657.9 FPS)

to

[2024-01-30T20:40:43Z INFO  wgpu_examples::framework] Frame time 0.99ms (1013.1 FPS)

If the window thing is of importance I can create an issue for it

Hmm.. running steam increases frame times and is also hidden when runnign the env var command example.

pop-os / cosmic-comp

Performance with Intel surfaces rendered with Intel GPU to Nvidia output #211