Closed ardera closed 4 years ago
What's your use case?
From your firmware log you appear to be loading the vc4-fkms-v3d
overlay twice, which could be confusing (but not the cause of the delay).
From your firmware log you appear to be loading the
vc4-fkms-v3d
overlay twice, which could be confusing (but not the cause of the delay).
Okay, I'll change it anyway to see if it fixes something.
What's your use case?
I'm using omxplayer
to implement video playback for flutter-pi. I'm using DRM planes to draw stuff behind and in front of omxplayer.
I've added logging of the timing around the drmModeSetPlane call in drm_mmal. I'm seeing either a couple of ms, or 15ms. (My display is running at 60Hz). I've inserted a 1s delay between each drmModeSetPlane call so that it is obvious what each update is doing.
Looking at both the firmware I'm not seeing any big delays on any of the updates. It says the update via the mailbox call is completed in under 1ms. The kernel logging shows entries similar to
[ 4423.438234] [drm:drm_atomic_state_init [drm]] Allocated atomic state 7b73c98d
[ 4423.438426] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:38:plane-1] bc2c41fb state to 7b73c98d
[ 4423.438616] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:52:crtc-0] d1ac5951 state to 7b73c98d
[ 4423.438801] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:67] for [PLANE:38:plane-1] state bc2c41fb
[ 4423.438986] [drm:drm_atomic_check_only [drm]] checking 7b73c98d
[ 4423.439084] [drm:vc4_plane_atomic_check [vc4]] [PLANE:38:plane-1] plane update 1920x1080@3 +dst(0,0, 1360,768) +src(0,0, 125829120,70778880) 0xe1100000/e12fe000/e137d800/1920, alpha 65535 zpos 1
[ 4423.439161] [drm:vc4_crtc_atomic_check [vc4]] [CRTC:52] crtc_atomic_check.
[ 4423.439348] [drm:drm_atomic_commit [drm]] committing 7b73c98d
[ 4423.439446] [drm:vc4_plane_set_blank [vc4]] [PLANE:38:plane-1] overlay plane unblank
[ 4423.439828] [drm:vc4_crtc_atomic_flush [vc4]] [CRTC:52] crtc_atomic_flush.
[ 4423.448128] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 7b73c98d
[ 4423.448320] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 7b73c98d
So we do have a delay between atomic_flush and state clear and free.
4.19 looks like there is a potentially a delay for vsync during commit - https://github.com/raspberrypi/linux/blob/rpi-4.19.y/drivers/gpu/drm/drm_plane_helper.c#L412
I'm on the 5.4 kernel. I haven't checked whether we go through the path, but drm_atomic_helper_commit_tail looks like it will block. Actually I think we use vc4_atomic_commit. That calls vc4_atomic_complete_commit, which calls drm_atomic_helper_wait_for_flip_done.
The DRM API docs are a nightmare to find out what is actually the expected behaviour, but I think that drmModeSetPlane will block for vsync when sitting on top of an atomic driver. Unfortunately I don't immediately see API calls that do allow for a non-blocking update. I suspect that libdrm may not allow it, and potentially the code in the igt repo (Intel Graphics Tests) may be more suitable. Sorry, I'm feeling my way around this lot as much as anyone.
Just edited my top comment to include a reduced test case.
I consistently get ~14.7ms there, so I'm able to reproduce the vsync delay. However 14.7ms is still far from the delays I was experiencing when running my main application. Maybe omxplayer
somehow tries to get the displays refresh rate to match that of the video, though I disabled all the options regarding that feature. If I find that it's a kernel problem, I'll open a new issue.
The DRM API docs are a nightmare to find out what is actually the expected behaviour, but I think that drmModeSetPlane will block for vsync when sitting on top of an atomic driver.
Yeah, I don't know either. I heard something about drmModeSetPlane
et al being "unsynchronized nightmares" somewhere (that's why I assumed it was not vsynced), but nothing specific about atomic was mentioned. In contrast this nvidia doc says the nvidia driver, too, is waiting for vblank.
So I guess I'll have to switch to atomic then.
EDIT: Seems like, in my main application, drmModeSetPlane
sometimes always misses a vblank. So the drmModeSetPlane
call is executing for like 12ms, a page flip event for my primary plane ocurrs and 14.5ms after the page flip, the call to drmModeSetPlane
returns. I'll try to build a test case for that.
(I'm drawing to two planes simultaneously, and the page flip event is triggered by the drmModePageFlip
for the primary plane.)
Updated my reproduction code to reproduce the issue when concurrently drawing to two planes.
I now consistently get 29.5ms delay with drmModeSetPlane
with that code. Expected would be 14.5ms, as there's no reason for drmModeSetPlane
to wait for two vblanks (that I know of, at least)
Were you able to find a solution to this? I'm running into the same issue.
Are you using the kms or fkms driver? You should be on kms (default with RpiOS bullseye).
I'm on KMS
I believe drmModeAtomicCommit
is expected to block for up to a vsync.
Look into the flags DRM_MODE_ATOMIC_NONBLOCK
and DRM_MODE_PAGE_FLIP_ASYNC
if you don't want this behaviour.
I'm experimenting with updating 2 planes in a single vblank. If I call drmModeAtomicCommit
with DRM_MODE_ATOMIC_NONBLOCK
twice during the same vblank, the call fails with EBUSY
.
I've also tried calling drmModeSetPlane
and drmModeAtomicCommit
from different threads, though each call will block for 2 vlanks instead of 1.
Hard to comment without seeing exactly what you are doing. I can say that kodi runs on kms, and displays a video and a gui plane and can render 60fps video, so updating 2 planes in a single vsync is possible.
I'm experimenting with updating 2 planes in a single vblank. If I call
drmModeAtomicCommit
withDRM_MODE_ATOMIC_NONBLOCK
twice during the same vblank, the call fails withEBUSY
.I've also tried calling
drmModeSetPlane
anddrmModeAtomicCommit
from different threads, though each call will block for 2 vlanks instead of 1.
I think that's expected, if you want to update 2 planes in a single vblank, you have to put them both in the same atomic request
That drmModeSetPlane takes that long to execute is expected as well (AFAICT), at least since atomic KMS is around
EDIT: Though I absolutely agree, KMS synchronization could be better documented. But that's not a problem of the raspberry pi kernel really
Thank you! This is exactly what I needed.
Anybody here who had success of using drmModeSetPlane() without VSYNC ?
I managed to write into the frame buffer directly using CPU & mmap: https://github.com/Consti10/hello_drmprime/blob/master/drm-howto/modeset_latency.cpp
But all methods exposed by drm that swap the underlying frame buffer seem to force VSYNC.
@Consti10 DRM_MODE_PAGE_FLIP_ASYNC
is the flag you're looking for. If you put that as an argument to drmModeAtomicCommit
, it'll change the fb without waiting for vsync (See also popcornmix' reply above)
Do you know how to add this flag for example to drmModeSetCrtc() ?
I've tried this snippet, but it doesn't have any effect on rpi. https://github.com/Consti10/hello_drmprime/blob/master/drm-howto/modeset-double-buffered_latency.cpp#L568
Aka drmModeSetCrtc() takes: Avg SwapBuffers:min=564.344971us max=21.431000ms avg=8.972000ms One can nicely see here how it blocks on average 8.9ms (1/2 a VSYNC).
There's no way to provide this flag to drmModeSetCrtc
, even the underlying ioctl doesn't support it.
You can add it to drmModePageFlip
though[^1]. In this case, just replace DRM_MODE_PAGE_FLIP_EVENT
with DRM_MODE_PAGE_FLIP_EVENT | DRM_MODE_PAGE_FLIP_ASYNC
.
[^1]: Not entirely sure. It should work, but the introduction of atomic modesetting changed some of the semantics (because they internally now emulate legacy modesetting (== drmModeSetCrtc
, drmModePageFlip
, drmModeSetPlane
, etc) via atomic modesetting). If this doesn't work, you can try using atomic modesetting
There's no way to provide this flag to
drmModeSetCrtc
, even the underlying ioctl doesn't support it.You can add it to
drmModePageFlip
though1. In this case, just replaceDRM_MODE_PAGE_FLIP_EVENT
withDRM_MODE_PAGE_FLIP_EVENT | DRM_MODE_PAGE_FLIP_ASYNC
.Footnotes
- Not entirely sure. It should work, but the introduction of atomic modesetting changed some of the semantics (because they internally now emulate legacy modesetting (==
drmModeSetCrtc
,drmModePageFlip
,drmModeSetPlane
, etc) via atomic modesetting). If this doesn't work, you can try using atomic modesetting leftwards_arrow_with_hook
Doesn't look like it works. even though I pass the flag to drmModePageFlip() I still cannot observe any tearing.
This really sucks. There is literally no way to disable VSYNC on rpi when using DRM api(s).
In our specific use case (low latency video decoding and display) the only option would be to create a single frame buffer, mmap it to user space (like done here: https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c) and copy the whole raw data of the decoded frame via CPU into this mmaped buffer.
It should work. The driver has the async page flip capability, and I looked a bit in the src, there's code in place to make async commits work. (e.g. this is what commits the async update) So maybe there's a way to get it working.
Other than that, is up to 16ms of delay really too much for low latency video playback?
After a bit more investigation,yes, DRM_MODE_PAGE_FLIP_ASYNC has an effect. However, not the wanted one. With the flag not set, I get: Avg PageFlipRequest:min=1.602000ms max=3.708000ms avg=1.757000ms Avg FrameDelta:min=13.849000ms max=19.549000ms avg=16.863001ms
Aka the execution of drmModePageFlip() (which requests a page flip which is then handled later) takes 1.7ms on average, and the application is running at 60fps.
With the flag set, I get: Avg FrameDelta:min=748.119019us max=904.747986us avg=797.583984us Avg PageFlipRequest:min=736.416016us max=852.617981us avg=780.658020us
BUT no tearing. So I assume the driver is still internal syncing the swaps to the VSYNC, but doesn't block the callbacks anymore. This doesn't really help though ;/ we really need no VSYNC aka tearing (yes, we want tearing haha ;) ) for low latency.
Here is the code if anybody is interested: https://github.com/Consti10/hello_drmprime/blob/fad09ccfddcae3e29848e23cd700fbe2ab3714a6/drm-howto/modeset-vsync2.cpp#L690
What really sucks here is that to me, it looks as if some wrong changes introduced all these issues. For example, the documentation drm-kms reads - as an example:
A call to drmModeSetCrtc(3) is executed immediately and forces the CRTC to use the new scanout buffer.
Aka should just swap out the scanout buffer.
But at least with the current rpi kernel, this is not true anymore, it waits for a VSYNC.
The documentation from "drm-howto" therefore is also outdated. For example, the modeset-double-buffered example is supposed to be double buffering without VSYNC. but due to the above issue, it actually is double buffered with VSYNC.
AFAICT, this is not a bug, just outdated documentation. Even in the upstream kernel source, it seems like every drmModeSetCrtc
is implictly waiting for vblank. This seems to be the new behaviour since atomic kernel modesetting was introduced ~4-5 years ago.
So I assume the driver is still internal syncing the swaps to the VSYNC, but doesn't block the callbacks anymore.
I guess some component of it still is, yeah. I also tried and couldn't get tearing.
AFAICT, this is not a bug, just outdated documentation. Even in the upstream kernel source, it seems like every
drmModeSetCrtc
is implictly waiting for vblank. This seems to be the new behaviour since atomic kernel modesetting was introduced ~4-5 years ago.So I assume the driver is still internal syncing the swaps to the VSYNC, but doesn't block the callbacks anymore.
I guess some component of it still is, yeah. I also tried and couldn't get tearing.
I can't believe there is no way to disable VSYNC in linux so to say - surely there is one ? I guess with OpenGL for example you can just get the framebuffer handel (the data area that is read out) and draw with it just like drawing into it with CPU, but for double buffering + no VSYNC you'd then need drm again.
Btw, my assumption how (for example) drmModeSetPlane is implemented on raspberry pi 4 (going down the route): 1) https://github.com/grate-driver/libdrm/blob/master/xf86drmMode.c#L988 2) https://github.com/raspberrypi/linux/blob/aeaa2460db088fb2c97ae56dec6d7d0058c68294/drivers/gpu/drm/drm_ioctl.c#L670 3)https://github.com/raspberrypi/linux/blob/rpi-5.10.y/drivers/gpu/drm/drm_plane.c#L800 4) https://github.com/raspberrypi/linux/blob/rpi-5.10.y/drivers/gpu/drm/drm_plane.c#L771
What happens here: https://github.com/raspberrypi/linux/blob/rpi-5.10.y/drivers/gpu/drm/drm_plane.c#L786
And where the actual VSYNC-ing happens i do not know.
I've added logging of the timing around the drmModeSetPlane call in drm_mmal. I'm seeing either a couple of ms, or 15ms. (My display is running at 60Hz). I've inserted a 1s delay between each drmModeSetPlane call so that it is obvious what each update is doing.
Looking at both the firmware I'm not seeing any big delays on any of the updates. It says the update via the mailbox call is completed in under 1ms. The kernel logging shows entries similar to
[ 4423.438234] [drm:drm_atomic_state_init [drm]] Allocated atomic state 7b73c98d [ 4423.438426] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:38:plane-1] bc2c41fb state to 7b73c98d [ 4423.438616] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:52:crtc-0] d1ac5951 state to 7b73c98d [ 4423.438801] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:67] for [PLANE:38:plane-1] state bc2c41fb [ 4423.438986] [drm:drm_atomic_check_only [drm]] checking 7b73c98d [ 4423.439084] [drm:vc4_plane_atomic_check [vc4]] [PLANE:38:plane-1] plane update 1920x1080@3 +dst(0,0, 1360,768) +src(0,0, 125829120,70778880) 0xe1100000/e12fe000/e137d800/1920, alpha 65535 zpos 1 [ 4423.439161] [drm:vc4_crtc_atomic_check [vc4]] [CRTC:52] crtc_atomic_check. [ 4423.439348] [drm:drm_atomic_commit [drm]] committing 7b73c98d [ 4423.439446] [drm:vc4_plane_set_blank [vc4]] [PLANE:38:plane-1] overlay plane unblank [ 4423.439828] [drm:vc4_crtc_atomic_flush [vc4]] [CRTC:52] crtc_atomic_flush. [ 4423.448128] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 7b73c98d [ 4423.448320] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 7b73c98d
So we do have a delay between atomic_flush and state clear and free.
4.19 looks like there is a potentially a delay for vsync during commit - https://github.com/raspberrypi/linux/blob/rpi-4.19.y/drivers/gpu/drm/drm_plane_helper.c#L412
I'm on the 5.4 kernel. I haven't checked whether we go through the path, but drm_atomic_helper_commit_tail looks like it will block. Actually I think we use vc4_atomic_commit. That calls vc4_atomic_complete_commit, which calls drm_atomic_helper_wait_for_flip_done.
The DRM API docs are a nightmare to find out what is actually the expected behaviour, but I think that drmModeSetPlane will block for vsync when sitting on top of an atomic driver. Unfortunately I don't immediately see API calls that do allow for a non-blocking update. I suspect that libdrm may not allow it, and potentially the code in the igt repo (Intel Graphics Tests) may be more suitable. Sorry, I'm feeling my way around this lot as much as anyone.
Am I right to assume that vc4 so to say is the rpi display output pipeline driver ? In theory, modifying it (removing drm_atomic_helper_wait_for_flip_done() or making it optional by env parameter) should lead to a immediate swap not synced with VSYNC (could introduce issues, but I doubt it, since in reality only the adress where the composer reads the pixels from is switched out by the driver). And this would be the only way if one wants to remove the forced VSYNC on drmXXX calls ?
Hacking the kernel for something that should be so simple is a bit overkill, but I can't see any alternative(s).
Btw, to elaborate on the "display pipeline latency" - I have written 2 test programs that allow you to measure the delay of rpi display pipeline. Each time you click a key, the LED of rpi is toggled on/off. And then a new solid colored image is drawn into the frame buffer. In case of no VSYNC, the CPU directly writes into the frame buffer that is currently read out, which takes ~5ms. In case of VSYNC, the front and back buffer is swapped out, which have been set to different colors before.
Then, by filming bot the LED and the connected monitor (HDMI, LG IPS Gaming 1ms response time, in theory capable of 144hz but only running on 60fps due to rpi limitations) one can measure the delay of the display pipeline.
My results: Each sample is "first change can be observed at top of screen" -> full screen is filled. First the frames, then in ms.
1) .modeset_latency, CPU draws frame directly: 3:5 => 12.48ms: 20.8ms 2:5 => 8.32ms : 20.8ms 3:8 => 8.32ms : 33.28ms
2) .modeset-double-buffered_latency, Swap front and back VSYNCed 7:8 => 29.12 : 33.28 4:6 => 16.64 : 24.96 6:7 => 24.96 : 29.12
Doesn't look like it works. even though I pass the flag to drmModePageFlip() I still cannot observe any tearing. This really sucks. There is literally no way to disable VSYNC on rpi when using DRM api(s). In our specific use case (low latency video decoding and display) the only option would be to create a single frame buffer, mmap it to user space (like done here: https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c) and copy the whole raw data of the decoded frame via CPU into this mmaped buffer.
I think this is probably true. The hardware composes a scene (which is a list of planes, with source and dest rectangles, alpha etc) from a display list in the HVS context RAM. The pointer to this display list is latched by hardware around the Vsync, so once per frame. You cannot make the hardware see changes to this poitner than once per frame.
As you say, you are free to modify the contents (pixels) of a plane while it is being scanned out and you can get tearing that way (and theoretically lower latency).
The other option, of leaving the display list pointer alone, but directly modifying the display list context ram, is I believe, in undefined behaviour territory. It is not how the hardware is designed and the kernel vc4 driver will not deliberately do this.
Doesn't look like it works. even though I pass the flag to drmModePageFlip() I still cannot observe any tearing. This really sucks. There is literally no way to disable VSYNC on rpi when using DRM api(s). In our specific use case (low latency video decoding and display) the only option would be to create a single frame buffer, mmap it to user space (like done here: https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c) and copy the whole raw data of the decoded frame via CPU into this mmaped buffer.
I think this is probably true. The hardware composes a scene (which is a list of planes, with source and dest rectangles, alpha etc) from a display list in the HVS context RAM. The pointer to this display list is latched by hardware around the Vsync, so once per frame. You cannot make the hardware see changes to this poitner than once per frame.
As you say, you are free to modify the contents (pixels) of a plane while it is being scanned out and you can get tearing that way (and theoretically lower latency).
The other option, of leaving the display list pointer alone, but directly modifying the display list context ram, is I believe, in undefined behaviour territory. It is not how the hardware is designed and the kernel vc4 driver will not deliberately do this.
I can confirm that "works" if you use a dump RGBA buffer and draw into it via CPU directly. I do not know if that would work with a SAND buffer (like required for h265 video), and I also don't know how to mmap both a drmPrime SAND buffer from the decoder as well as the composed buffer into user space for copying with CPU.
In the end, I really want to create a h265 video player application with as low delay as possible for rpi. Rn I am trying to find out what would be the least amount of effort ;) 1) doing it via CPU copy (not ideal, but possible in theory) 2) modify the kernel driver(s) 3) ??? ;)
The should be no issue with writing a sand buffer. The decoded buffers are typically dmabufs and can be mmapp-ed and accessed by CPU. The layout in RAM likely means a linear memcpy will produce tearing in 128-pixel wide vertical columns, rather than a horizontal tear line.
But you may need to explain further why tearing video is preferable to you - in general a lot of effort is made not to tear. Is this for streaming a game over a network?
I can't understand how you can avoid copying a buffer when doing what you want. The video decoder won't be able to decode consecutive video frames into the same buffer (it needs to keep them intact as reference frames). And copying frame buffers will take time (and so latency) - certainly copying 4K video frames is likely to take significant time (and likely to make real time decode infeasible).
The should be no issue with writing a sand buffer. The decoded buffers are typically dmabufs and can be mmapp-ed and accessed by CPU. The layout in RAM likely means a linear memcpy will produce tearing in 128-pixel wide vertical columns, rather than a horizontal tear line.
Yeah, can't find any examples to do so though ;) But you may need to explain further why tearing video is preferable to you - in general a lot of effort is made not to tear. Is this for streaming a game over a network?
Streaming a live video feed from a drone to rpi in real time. OpenHD. Especially if you are dealing with - for example 1080p90fps or 720p120fps - video, tearing is not noticeable to the average eye, but the latency reduction is noticeable.
I can't understand how you can avoid copying a buffer when doing what you want. The video decoder won't be able to decode consecutive video frames into the same buffer (it needs to keep them intact as reference frames). And copying frame buffers will take time (and so latency) - certainly copying 4K video frames is likely to take significant time (and likely to make real time decode infeasible).
Yes, this case would involve copying the frame via CPU, which is why I don't think it's ideal. For our resolutions (1080p and 720p) this should work though.
I can't believe there is no way to disable VSYNC in linux so to say - surely there is one ?
Well the only ways are AFAICT
drmModePageFlip
with DRM_MODE_PAGE_FLIP_ASYNC
, which doesn't seem to work in this caseThe other option, of leaving the display list pointer alone, but directly modifying the display list context ram, is I believe, in undefined behaviour territory. It is not how the hardware is designed and the kernel vc4 driver will not deliberately do this.
Not sure I understand, it seems like there's some code that implements async page flips. I mean it makes perfect sense if modifying the displaylist while it's being scanned out is undefined behaviour, just wondering why that code's there anyway. Actually, I thought perhaps the HVS was too fast and the FIFO it's putting the pixels in is too big, so the HVS has already put all composited pixels in the FIFO when you call drmModePageFlip
, but actually the FIFO is only ~15000 pixels big on my Pi, so that can't be it
I did check with the HVS hardware guy and he confirmed that altering the pointer word in the context memory during a frame won't have any effect. That only gets read at start of frame. The hardware does write the increasing address to the pointer context word at end of each line and reads it at start of next line but there's unlikely to be a safe way of altering that.
I did check with the HVS hardware guy and he confirmed that altering the pointer word in the context memory during a frame won't have any effect. That only gets read at start of frame. The hardware does write the increasing address to the pointer context word at end of each line and reads it at start of next line but there's unlikely to be a safe way of altering that.
That is some really valuable information. It means it is impossible or close to impossible to just swap out the buffer that is currently read out by the composer during compose. Aka the only option would literally be to copy the data into the right place. I'l have to test how the rpi performs in this regard. Aka how taxiing a memcopy via cpu of 1080p or 720p video data is. If it is less than 2-3ms for a frame, I'd consider it a feasible option if ultra low latency is of as high priority as for us. If not, the only option is to use a higher refresh rate display. One thing I also haven't figured out is how to update drm with VSYNC if the input video is higher refresh rate than the connected display. Since drmModeSwapPlane blocks the full 16ms, it is currently not possible to just update the framebuffer to use the most recent frame on VSYNC in this case. Aka just before VSYNC, one would probably like to fetch the most recent video frame, and display this one to the user. But with drmModeSwapPlane blocking, that cannot be done easily.
One thing I also haven't figured out is how to update drm with VSYNC if the input video is higher refresh rate than the connected display.
I think most people would just use the lazy solution and just show the first video frame arriving in a given vsync period. However since you want low latency that's a bit more complicated. I think your solution with a timer would be okay (hacky, but there's really no better way), you just need to use something more suited than drmModeSetPlane
. Just use drmModePageFlip
or drmModeAtomicCommit
, with those two you can do non-blocking page flips.
Though, with a bit of hacking, maybe you can do triple buffering on the vc4. Since DRM_MODE_PAGE_FLIP_ASYNC
doesn't do non-vsynced updates (as it should), it instead will just replace whatever fb is queued for the next vblank. So you can just, whenever a new video frame arrives, call drmModePageFlip
with DRM_MODE_PAGE_FLIP_ASYNC
and replace the fb queued for the next vblank.
Only problem with that is, when the page flip happens, it's hard to tell which buffer is actually being shown on screen. (And you need to know that because you can't just render into the fb being shown on screen). But maybe you can do a drmModeGetPlane()
of the primary plane to find out what fb is being scanned out.
So I did some testing regarding memcpy - I think I was able to succesfully mmap the buffer of the decoded frame, and copy the raw data. (still not directly into the fb, have to fgure out how to mmap the drm frame buffer first). Copying a 1080p (1920x1080) SAND frame (3133440 Bytes) takes ~6.2ms on rpi CM4 running at 1.5Ghz.
Assuming allocating one CPU core for copying data around, this would allow for ~ 1080p@160fps with a memcpy only approach. And if I can somehow get the current rasterizer position, one could even optimize the memcpy such that it copies data directly in front of the rasterizer.
You can get the current rasterizer position, this is how the driver does it: https://github.com/raspberrypi/linux/blob/6dc14c0e44ca49f59b5b2cb38b053b08afb37124/drivers/gpu/drm/vc4/vc4_crtc.c#L85
Since you can map physical memory (open /dev/mem
and mmap it) you can just do the same thing the driver does. The HVS regs are at 0xFE400000
. Or you deduce it from the debug information in /sys/kernel/debug/dri/1
, though that might be too slow.
Btw, a while ago someone on a discord I'm on was doing the same thing as you (for a different reason) and he said that the memcpy
actually took longer if the framebuffer was active (shown on screen)
Any idea how to replace drmModeSetPlane() with drmModePageFlip() ?
I've tried the following code: https://github.com/Consti10/hello_drmprime/blob/e0069b38abdc31976e1cf8b80dd255020daa5341/drmprime_out.cpp#L247
but get drmModePageFlip failed: Invalid argument
The code performs drmModeSetPlane() on the first couple of frames for testing, then switches over to drmModePageFlip()
I've also implemented (or rather hacked) the most simple method to reduce latency with VSYNC enabled - Ideally, just before VSYNC, the displayThread (if you take the hello_drmprime for examle) is woken up and fetches the most recent decoded video frame. Then updates the crtc via drmModeSetPlane() which should only take almost 0 ms, since the block until the next VSYNC is ~0ms.
To implement that, the easiest way would be
1) call drmModeSetPlane() -> when it returns, we know a VSYNC has just happened and the next one should be in a DISPLAY_REFRESH_RATE intervall
2) sleep close to DISPLAY_REFRESH_RATE
3) Fetch the most recent video frame, drop all (possible) old frames if input video fps > display refresh rate
4) call drmModeSetPlane() which should return almost immediately (since we are just before a VSYNC).
In reality, for a 60hz display, I can (busy) sleep for 12ms after a drmModeSetPlane() before then drmModeSetPlane() starts missing the VSYNC again.
A solution with drmModePageFlip() and ASYNC would have the same effect (and would be even better), but no idea how to replace drmModeSetPlane() with drmModePageFlip().
Any idea how to replace drmModeSetPlane() with drmModePageFlip() ?
I've tried the following code: https://github.com/Consti10/hello_drmprime/blob/e0069b38abdc31976e1cf8b80dd255020daa5341/drmprime_out.cpp#L247
but get drmModePageFlip failed: Invalid argument
Can you try without the DRM_MODE_PAGE_FLIP_EVENT
? I'm not 100% sure you can combine it with the DRM_MODE_PAGE_FLIP_EVENT_ASYNC
Btw there are no defined flags for drmModeSetPlane
so the DRM_MODE_PAGE_FLIP_ASYNC | DRM_MODE_ATOMIC_NONBLOCK
you're giving it as args are completely ignored
Any idea how to replace drmModeSetPlane() with drmModePageFlip() ? I've tried the following code: https://github.com/Consti10/hello_drmprime/blob/e0069b38abdc31976e1cf8b80dd255020daa5341/drmprime_out.cpp#L247 but get drmModePageFlip failed: Invalid argument
Can you try without the
DRM_MODE_PAGE_FLIP_EVENT
? I'm not 100% sure you can combine it with theDRM_MODE_PAGE_FLIP_EVENT_ASYNC
Btw there are no defined flags for
drmModeSetPlane
so theDRM_MODE_PAGE_FLIP_ASYNC | DRM_MODE_ATOMIC_NONBLOCK
you're giving it as args are completely ignored
Yeah, I've tried all the different permutations, it doesn't work. I can't figure out though if it is due to some weird plane != fb incompability or a programming mistake. This stuff is so undocumented.
I've gotten kmscube working with async pageflips. see here
it does an initial drmModeSetCrtc
to set the mode and then just
drmModePageFlip(drm.fd, drm.crtc_id, fb->fb_id, DRM_MODE_PAGE_FLIP_ASYNC, &waiting_for_flip);
When I memcpy the decoded SAND frames into the currently read out frame buffer, I get the following artifacts:
This is 720p @ 1 fps, alternating green and red frames.
Since the display is running at 60fps, and the video is running at 1 fps, I'd expect there to be tearing once the frame is copied (~6ms) but then ~59 frames should follow without artifacts.
But these weird artefacts don't seem to follow a pattern, and don't really make much sense. Any ideas how to get rid of that / what the hardware is doing here ?
Screenshot:
Describe the bug Calling
drmModeSetPlane
to change the FB associated with a DRM overlay plane takes 20-30ms to execute. In my case, I'm also reflecting the Y axis of the plane, but I'm not sure that matters.To reproduce Unfortunately, reproducing this is not that straight forward.
Roughly, it's this:
I'm working on a C implementation of this reproduction.see hereExpected behaviour Expected would be an execution time in the micro-seconds, since
drmModeSetPlane
does not care about vertical synchronization at all. All it should be doing is change the fb id of the plane.Actual behaviour In reality, the call takes about 20-30ms to complete.
System
Additional context The content of the FB I'm presenting is from GL (that's the reason for the y-reflection).