Open egnor opened 2 years ago
@mripard Any thoughts on this?
The first one is wow, that bug report is awesome, thanks!
About the issue itself, yes, the analysis is accurate. It's been a while and it's fairly complex, so my recollection might be off a bit, but here it is.
Since all the planes are stored in the (single) HVS, and the muxing between the HVS and the pixelvalves (CRTCs) needs to be maintained, the HVS state is very much common to all the CRTCs and I'm not sure how we could operate reliably without the synchronisation.
It's some kind of a weird interaction between how KMS is designed and how the hardware works, but it's mainly due to a couple of things:
The plane RAM allocation is dealt with using a (locked) allocator, and we'll keep the RAM reserved for as long as we need it, so we're safe there.
The CRTC <-> HVS FIFO muxing is a bit difficult to work with though. Since we really want to make sure we have a FIFO available by the time we reach atomic_commit, we need to list all the ones that aren't. Now, we don't have the state of all the KMS entities in a commit global state, but only the ones affected by that commit, so it's not truly global. Thus, we don't have the current state of the other CRTCs, and we can't just access their FIFO.
We thus need to access their state for that new commit, and doing so creates a dependency of our new state on the old state, thus making us wait for the previous one to complete. Pulling the other CRTC state created other issues (see f2df84e096a8254ddb18c531b185fc2a45879077), so we moved from it to a global shared state.
That still doesn't explain why we need to wait for the previous commit though. KMS also has non-blocking atomic commits where the atomic_check will be run right away, but the atomic_commit isn't, it's just scheduled to run and will do so at some point. Like we said before, if we have two non-blocking commits running in parallel and affecting two separate CRTCs, and not sharing any resource, KMS won't serialize them.
So let's say we have two non-blocking commits in rapid succession completely independent from each other. KMS will schedule their atomic_commit, but you then have no guarantee that the first commit is going to be executed first.
In addition, each state being committed has a pointer to both the state being committed, but also to the previous state. So we end up in atomic_check with:
Initial state Commit 1 (new state: state 1, old state: initial state) Commit 2 (new state: state 2, old state: state 1)
But if commit 2 runs before commit 1, in atomic_commit we'll end up with: Initial State Commit 2 (new state: state 2, old state: state 1) Commit 1 (new state: state 1, old state: Initial State)
The Commit 2 will have replaced Initial State, and our initial HVS global state has been freed. By the time we run Commit 1, if we access our old state, we do a use-after-free. See 9ec03d7f1ed394897891319a4dda75f52c5d292d
So we need to enforce the ordering of commits to prevent that pattern, and forcing that ordering means waiting for the previous commit to be done, vblank included...
It's really unfortunate, but without a complete redesign of the internal structures of KMS or of the hardware, I don't think we can do better :/
Thank you for the explainer! 💛
Any thoughts on how to actually achieve full rate video (update every vsync) on multiple displays? Surely it must be possible somehow.
I've never been clear on what happens if I submit an atomic update that touches multiple CRTCs. Does it wait for each of them to have their vsync?
I think it should work.
In such a case, the core will wait for all CRTCs vblank in the commit https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/vc4/vc4_kms.c#L412 https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_atomic_helper.c#L1562
Chances are the vsync will be synchronized, or close to, so you wouldn't wait for long and both CRTCs would be updated at once so you wouldn't end up in the pattern above.
Describe the bug
When outputting to multiple displays, I see difficult-to-explain slowness where a plane-only atomic KMS update will often take longer than a frame to complete, causing frame skips. This is my best understanding as to why this happens (but I am not an expert in the kernel code!):
As written in Linux kernel documentation "Mode Setting Helper Functions" extracted from
drm_atomic_helper.c
(boldface mine):However, the vc4 driver maintains several driver-global state structures: https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_drv.h#L226
These structures are always referenced during the KMS atomic check and commit process: https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L201 https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L461 https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L365 https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L607 etc.
Because these are driver-global structures, this necessarily serializes atomic changes, even across display outputs (CRTCs). For example, consider this timeline assuming a 50Hz refresh, 0.020s/frame, on both HDMI-1 and HDMI-2:
The delay happens because the HDMI-1 request ends up needing to wait for the HDMI-2 request to complete, since they both depend on the same driver-global data structures. As written in Linux kernel documentation "Kernel Mode Setting" about private objects:
(In the vc4 driver, I believe these synchronization holds are done by the standard helper functions.)
From the user perspective, what this means is that it seems very difficult (perhaps impossible) to use atomic KMS to drive full-rate output on multiple displays. Even if as a caller one attempts to observe vsync timing and sequence atomic requests, there is the (quite likely!) case where two displays have a vsync very close to each other, such that it's difficult to know the order to make the request, and indeed even if known the serialization may still cause the second update to miss its vsync opportunity. What is seen is that atomic requests end up mysteriously delayed for over a frame, harming display refresh rate.
Changing this seems like it would be a major architectural change; indeed, I don't know how to perform the kind of resource tracking done by
vc4_load_tracker_atomic_check()
without driver global state. (Preallocated resources assigned to each CRTC?) However, at the same time, I don't know how to use these APIs for reliable full-rate output under these circumstances. And yet, I don't see the likes of X11 having trouble driving both displays... so quite likely there's something I don't understand??Steps to reproduce the behaviour
(Specific test code can be extracted, if it would be helpful.)
Device (s)
Raspberry Pi 4 Mod. B
System
Logs
No response
Additional context
See also: