Atomic KMS/DRM on multiple displays is hindered by driver-global serialization of atomic commits

egnor commented 2 years ago

Describe the bug

When outputting to multiple displays, I see difficult-to-explain slowness where a plane-only atomic KMS update will often take longer than a frame to complete, causing frame skips. This is my best understanding as to why this happens (but I am not an expert in the kernel code!):

As written in Linux kernel documentation "Mode Setting Helper Functions" extracted from drm_atomic_helper.c (boldface mine):

Asynchronous workers need to have sufficient parallelism to be able to run different atomic commits on different CRTCs in parallel. The simplest way to achieve this is by running them on the system_unbound_wq work queue. Note that drivers are not required to split up atomic commits and run an individual commit in parallel - userspace is supposed to do that if it cares. But it might be beneficial to do that for modesets, since those necessarily must be done as one global operation, and enabling or disabling a CRTC can take a long time. But even that is not required.

IMPORTANT: A drm_atomic_state update for multiple CRTCs is sequenced against all CRTCs therein. Therefore for atomic state updates which only flip planes the driver must not get the struct drm_crtc_state of unrelated CRTCs in its atomic check code: This would prevent committing of atomic updates to multiple CRTCs in parallel. In general, adding additional state structures should be avoided as much as possible, because this reduces parallelism in (nonblocking) commits, both due to locking and due to commit sequencing requirements.

However, the vc4 driver maintains several driver-global state structures: https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_drv.h#L226

These structures are always referenced during the KMS atomic check and commit process: https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L201 https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L461 https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L365 https://github.com/raspberrypi/linux/blob/10aaaecddda5ff530bc5bf8f2872b789555e7fee/drivers/gpu/drm/vc4/vc4_kms.c#L607 etc.

Because these are driver-global structures, this necessarily serializes atomic changes, even across display outputs (CRTCs). For example, consider this timeline assuming a 50Hz refresh, 0.020s/frame, on both HDMI-1 and HDMI-2:

T+0.000: HDMI-1 vsync
T+0.001: HDMI-2 vsync
T+0.002: Atomic request submitted for HDMI-2 (lots of time until the next vsync!)
T+0.003: Atomic request submitted for HDMI-1 (ditto); must happen after the previous request
T+0.020: next HDMI-1 vsync, BUT the requested HDMI-1 atomic update doesn't take place!
T+0.021: next HDMI-2 vsync, and the requested HDMI-2 atomic update does take place
T+0.040: another HDMI-1 vsync, NOW the requested HDMI-1 atomic update takes place (one frame delayed)

The delay happens because the HDMI-1 request ends up needing to wait for the HDMI-2 request to complete, since they both depend on the same driver-global data structures. As written in Linux kernel documentation "Kernel Mode Setting" about private objects:

If that private object is used to store a state shared by multiple CRTCs, proper care must be taken to ensure that non-blocking commits are properly ordered to avoid a use-after-free issue.

Indeed, assuming a sequence of two non-blocking drm_atomic_commit on two different drm_crtc using different drm_plane and drm_connector, so with no resources shared, there’s no guarantee on which commit is going to happen first. However, the second drm_atomic_commit will consider the first drm_private_obj its old state, and will be in charge of freeing it whenever the second drm_atomic_commit is done.

If the first drm_atomic_commit happens after it, it will consider its drm_private_obj the new state and will be likely to access it, resulting in an access to a freed memory region. Drivers should store (and get a reference to) the drm_crtc_commit structure in our private state in drm_mode_config_helper_funcs.atomic_commit_setup, and then wait for that commit to complete as the first step of drm_mode_config_helper_funcs.atomic_commit_tail, similar to drm_atomic_helper_wait_for_dependencies().

(In the vc4 driver, I believe these synchronization holds are done by the standard helper functions.)

From the user perspective, what this means is that it seems very difficult (perhaps impossible) to use atomic KMS to drive full-rate output on multiple displays. Even if as a caller one attempts to observe vsync timing and sequence atomic requests, there is the (quite likely!) case where two displays have a vsync very close to each other, such that it's difficult to know the order to make the request, and indeed even if known the serialization may still cause the second update to miss its vsync opportunity. What is seen is that atomic requests end up mysteriously delayed for over a frame, harming display refresh rate.

Changing this seems like it would be a major architectural change; indeed, I don't know how to perform the kind of resource tracking done by vc4_load_tracker_atomic_check() without driver global state. (Preallocated resources assigned to each CRTC?) However, at the same time, I don't know how to use these APIs for reliable full-rate output under these circumstances. And yet, I don't see the likes of X11 having trouble driving both displays... so quite likely there's something I don't understand??

Steps to reproduce the behaviour

(Specific test code can be extracted, if it would be helpful.)

Using KMS interfaces, initialize both HDMI outputs on a Pi 4 (or equivalent) with appropriate video modes
Run a separate thread doing full-rate atomic plane updates (a simple animation, say) on each output
Observe that frames are often skipped due to the serialization described above

Device (s)

Raspberry Pi 4 Mod. B

System

% cat /etc/rpi-issue
Raspberry Pi reference 2022-04-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 226b479f8d32919c9fe36dd5b4c20c02682f8180, stage4
% vcgencmd version
Mar 24 2022 13:19:26 
Copyright (c) 2012 Broadcom
version e5a963efa66a1974127860b42e913d2374139ff5 (clean) (release) (start)
% uname -a
Linux egnor-pi4-6 5.15.32-v7l+ #1538 SMP Thu Mar 31 19:39:41 BST 2022 armv7l GNU/Linux

Logs

No response

Additional context

See also:

6by9 commented 2 years ago

@mripard Any thoughts on this?

mripard commented 2 years ago

The first one is wow, that bug report is awesome, thanks!

About the issue itself, yes, the analysis is accurate. It's been a while and it's fairly complex, so my recollection might be off a bit, but here it is.

Since all the planes are stored in the (single) HVS, and the muxing between the HVS and the pixelvalves (CRTCs) needs to be maintained, the HVS state is very much common to all the CRTCs and I'm not sure how we could operate reliably without the synchronisation.

It's some kind of a weird interaction between how KMS is designed and how the hardware works, but it's mainly due to a couple of things:

Like said previously, KMS only serializes commits per CRTC, so if we have two concurrent commits that don't share any resource, we can pretty much assume that they are going to execute in parallel (we also have to take into account scheduling, internal locking and so on, so it's probably not going to happen in parallel, but we don't have any guarantee on the order anyway)
In the atomic world, commits are done in two steps, first we have atomic_check that is allowed to fail and then (optionally) we can perform a commit in atomic_commit. atomic_commit cannot fail, so the expectation is that all the resources are pre-allocated in atomic_check
There is a single HVS, common to all the CRTCs, that will store plane hardware descriptors in a single internal RAM. It also has a limited amout (N) of internal FIFOs that can be routed to a number of pixelvalves / CRTCs (M), with N < M. So we cannot assume that each CRTC has a FIFO it can use in any given situation.

The plane RAM allocation is dealt with using a (locked) allocator, and we'll keep the RAM reserved for as long as we need it, so we're safe there.

The CRTC <-> HVS FIFO muxing is a bit difficult to work with though. Since we really want to make sure we have a FIFO available by the time we reach atomic_commit, we need to list all the ones that aren't. Now, we don't have the state of all the KMS entities in a commit global state, but only the ones affected by that commit, so it's not truly global. Thus, we don't have the current state of the other CRTCs, and we can't just access their FIFO.

We thus need to access their state for that new commit, and doing so creates a dependency of our new state on the old state, thus making us wait for the previous one to complete. Pulling the other CRTC state created other issues (see f2df84e096a8254ddb18c531b185fc2a45879077), so we moved from it to a global shared state.

That still doesn't explain why we need to wait for the previous commit though. KMS also has non-blocking atomic commits where the atomic_check will be run right away, but the atomic_commit isn't, it's just scheduled to run and will do so at some point. Like we said before, if we have two non-blocking commits running in parallel and affecting two separate CRTCs, and not sharing any resource, KMS won't serialize them.

So let's say we have two non-blocking commits in rapid succession completely independent from each other. KMS will schedule their atomic_commit, but you then have no guarantee that the first commit is going to be executed first.

In addition, each state being committed has a pointer to both the state being committed, but also to the previous state. So we end up in atomic_check with:

Initial state Commit 1 (new state: state 1, old state: initial state) Commit 2 (new state: state 2, old state: state 1)

But if commit 2 runs before commit 1, in atomic_commit we'll end up with: Initial State Commit 2 (new state: state 2, old state: state 1) Commit 1 (new state: state 1, old state: Initial State)

The Commit 2 will have replaced Initial State, and our initial HVS global state has been freed. By the time we run Commit 1, if we access our old state, we do a use-after-free. See 9ec03d7f1ed394897891319a4dda75f52c5d292d

So we need to enforce the ordering of commits to prevent that pattern, and forcing that ordering means waiting for the previous commit to be done, vblank included...

It's really unfortunate, but without a complete redesign of the internal structures of KMS or of the hardware, I don't think we can do better :/

egnor commented 2 years ago

Thank you for the explainer! 💛

Any thoughts on how to actually achieve full rate video (update every vsync) on multiple displays? Surely it must be possible somehow.

I've never been clear on what happens if I submit an atomic update that touches multiple CRTCs. Does it wait for each of them to have their vsync?

mripard commented 1 year ago

I think it should work.

In such a case, the core will wait for all CRTCs vblank in the commit https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/vc4/vc4_kms.c#L412 https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_atomic_helper.c#L1562

Chances are the vsync will be synchronized, or close to, so you wouldn't wait for long and both CRTCs would be updated at once so you wouldn't end up in the pattern above.

raspberrypi / linux