drmModeSetPlane takes 20-30ms to execute

ardera commented 4 years ago

Describe the bug Calling drmModeSetPlane to change the FB associated with a DRM overlay plane takes 20-30ms to execute. In my case, I'm also reflecting the Y axis of the plane, but I'm not sure that matters.

To reproduce Unfortunately, reproducing this is not that straight forward.

Roughly, it's this:

open DRI device
find out crtc id, plane id
create two bo's, add them as DRM FBs
repeatedly swap the FB associated with the plane

~~I'm working on a C implementation of this reproduction.~~ see here

Expected behaviour Expected would be an execution time in the micro-seconds, since drmModeSetPlane does not care about vertical synchronization at all. All it should be doing is change the fb id of the plane.

Actual behaviour In reality, the call takes about 20-30ms to complete.

System

```text System Information ------------------ Raspberry Pi 4 Model B Rev 1.1 PRETTY_NAME="Raspbian GNU/Linux 10 (buster)" NAME="Raspbian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" Raspberry Pi reference 2019-09-26 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 80d486687ea77d31fc3fc13cf3a2f8b464e129be, stage2 Linux hpi4 4.19.118-v7l+ #1311 SMP Mon Apr 27 14:26:42 BST 2020 armv7l GNU/Linux Revision : b03111 Serial : 100000001afe70ee Model : Raspberry Pi 4 Model B Rev 1.1 Throttled flag : throttled=0x0 Camera : supported=0 detected=0 Videocore information --------------------- May 11 2020 18:59:41 Copyright (c) 2012 Broadcom version 21bfdeee3a6ea823e2113b983390acd1eec8edfb (clean) (release) (start) alloc failures: 0 compactions: 0 legacy block fails: 0 Filesystem information ---------------------- Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 30391116 7895428 21225488 28% / devtmpfs 769544 0 769544 0% /dev tmpfs 901640 0 901640 0% /dev/shm tmpfs 901640 8672 892968 1% /run tmpfs 5120 4 5116 1% /run/lock tmpfs 901640 0 901640 0% /sys/fs/cgroup /dev/mmcblk0p1 258095 52233 205863 21% /boot tmpfs 180328 0 180328 0% /run/user/1000 Filename Type Size Used Priority /var/swap file 102396 0 -2 Package version information --------------------------- raspberrypi-ui-mods: Installed: 1.20200603 raspberrypi-sys-mods: Installed: 20200514 openbox: Installed: 3.6.1-8+rpt4 lxpanel: Installed: 0.10.0-2+rpt10 pcmanfm: Installed: 1.3.1-1+rpt21 rpd-plym-splash: Installed: 0.23 Networking Information ---------------------- USB Information --------------- /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/1p, 480M |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M config.txt ---------- arm_freq=1500 audio_pwm_mode=514 config_hdmi_boost=5 core_freq=500 core_freq_min=200 disable_commandline_tags=2 disable_l2cache=1 disable_overscan=1 display_default_lcd=1 display_hdmi_rotate=-1 display_lcd_rotate=-1 enable_gic=1 enable_uart=1 force_eeprom_read=1 force_pwm_open=1 framebuffer_ignore_alpha=1 framebuffer_swap=1 gpu_freq=500 gpu_freq_min=250 init_uart_clock=0x2dc6c00 lcd_framerate=60 mask_gpu_interrupt0=1024 mask_gpu_interrupt1=0x10000 max_framebuffers=2 over_voltage_avs=-23750 pause_burst_frames=1 program_serial_random=1 total_mem=2048 hdmi_force_cec_address:0=65535 hdmi_force_cec_address:1=65535 hdmi_pixel_freq_limit:0=0x11e1a300 hdmi_pixel_freq_limit:1=0x11e1a300 device_tree=- overlay_prefix=overlays/ hdmi_cvt:0= hdmi_cvt:1= hdmi_edid_filename:0= hdmi_edid_filename:1= hdmi_timings:0= hdmi_timings:1= cmdline.txt ----------- coherent_pool=1M 8250.nr_uarts=1 cma=64M snd_bcm2835.enable_compat_alsa=0 snd_bcm2835.enable_hdmi=1 snd_bcm2835.enable_headphones=1 cma=256M cma=256M smsc95xx.macaddr=DC:A6:32:2C:42:1A vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000 console=ttyS0,115200 console=tty1 root=PARTUUID=6c586e13-02 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait raspi-gpio settings ------------------- BANK0 (GPIO 0 to 27): GPIO 0: level=1 fsel=0 func=INPUT pull=UP GPIO 1: level=1 fsel=0 func=INPUT pull=UP GPIO 2: level=1 fsel=0 func=INPUT pull=UP GPIO 3: level=1 fsel=0 func=INPUT pull=UP GPIO 4: level=1 fsel=0 func=INPUT pull=UP GPIO 5: level=1 fsel=0 func=INPUT pull=UP GPIO 6: level=1 fsel=0 func=INPUT pull=UP GPIO 7: level=1 fsel=1 func=OUTPUT pull=UP GPIO 8: level=1 fsel=1 func=OUTPUT pull=UP GPIO 9: level=0 fsel=4 alt=0 func=SPI0_MISO pull=DOWN GPIO 10: level=0 fsel=4 alt=0 func=SPI0_MOSI pull=DOWN GPIO 11: level=0 fsel=4 alt=0 func=SPI0_SCLK pull=DOWN GPIO 12: level=0 fsel=0 func=INPUT pull=DOWN GPIO 13: level=0 fsel=0 func=INPUT pull=DOWN GPIO 14: level=1 fsel=2 alt=5 func=TXD1 pull=NONE GPIO 15: level=1 fsel=2 alt=5 func=RXD1 pull=UP GPIO 16: level=0 fsel=0 func=INPUT pull=DOWN GPIO 17: level=0 fsel=0 func=INPUT pull=DOWN GPIO 18: level=0 fsel=0 func=INPUT pull=DOWN GPIO 19: level=0 fsel=0 func=INPUT pull=DOWN GPIO 20: level=0 fsel=0 func=INPUT pull=DOWN GPIO 21: level=0 fsel=0 func=INPUT pull=DOWN GPIO 22: level=0 fsel=0 func=INPUT pull=DOWN GPIO 23: level=0 fsel=0 func=INPUT pull=DOWN GPIO 24: level=0 fsel=0 func=INPUT pull=DOWN GPIO 25: level=0 fsel=0 func=INPUT pull=DOWN GPIO 26: level=0 fsel=0 func=INPUT pull=DOWN GPIO 27: level=0 fsel=0 func=INPUT pull=DOWN BANK1 (GPIO 28 to 45): GPIO 28: level=1 fsel=2 alt=5 func=RGMII_MDIO pull=UP GPIO 29: level=0 fsel=2 alt=5 func=RGMII_MDC pull=DOWN GPIO 30: level=0 fsel=7 alt=3 func=CTS0 pull=UP GPIO 31: level=0 fsel=7 alt=3 func=RTS0 pull=NONE GPIO 32: level=1 fsel=7 alt=3 func=TXD0 pull=NONE GPIO 33: level=1 fsel=7 alt=3 func=RXD0 pull=UP GPIO 34: level=0 fsel=7 alt=3 func=SD1_CLK pull=NONE GPIO 35: level=1 fsel=7 alt=3 func=SD1_CMD pull=UP GPIO 36: level=1 fsel=7 alt=3 func=SD1_DAT0 pull=UP GPIO 37: level=1 fsel=7 alt=3 func=SD1_DAT1 pull=UP GPIO 38: level=1 fsel=7 alt=3 func=SD1_DAT2 pull=UP GPIO 39: level=1 fsel=7 alt=3 func=SD1_DAT3 pull=UP GPIO 40: level=1 fsel=4 alt=0 func=PWM1_0 pull=NONE GPIO 41: level=0 fsel=4 alt=0 func=PWM1_1 pull=NONE GPIO 42: level=0 fsel=1 func=OUTPUT pull=UP GPIO 43: level=1 fsel=0 func=INPUT pull=UP GPIO 44: level=1 fsel=5 alt=1 func=SDA0 pull=UP GPIO 45: level=1 fsel=5 alt=1 func=SCL0 pull=UP BANK2 (GPIO 46 to 53): GPIO 46: level=0 fsel=0 func=INPUT pull=DOWN GPIO 47: level=0 fsel=0 func=INPUT pull=DOWN GPIO 48: level=0 fsel=0 func=INPUT pull=DOWN GPIO 49: level=0 fsel=0 func=INPUT pull=DOWN GPIO 50: level=0 fsel=0 func=INPUT pull=DOWN GPIO 51: level=0 fsel=0 func=INPUT pull=DOWN GPIO 52: level=0 fsel=0 func=INPUT pull=DOWN GPIO 53: level=0 fsel=0 func=INPUT pull=DOWN vcdbg log messages ------------------ 002729.890: arasan: arasan_emmc_open 002730.069: arasan: arasan_emmc_set_clock C0: 0x00000000 C1: 0x000e7187 emmc: 250000000 actual: 200000 div: 0x00000271 target: 200000 min: 100000 max: 400000 delay: 10 002834.840: arasan: arasan_emmc_set_clock C0: 0x00000000 C1: 0x000e7187 emmc: 250000000 actual: 200000 div: 0x00000271 target: 200000 min: 100000 max: 400000 delay: 10 002835.000: arasan: arasan_emmc_set_clock C0: 0x00000f00 C1: 0x000ec2c7 emmc: 250000000 actual: 129937 div: 0x000003c2 target: 130000 min: 100000 max: 400000 delay: 15 002886.823: arasan: arasan_emmc_set_clock C0: 0x00000f06 C1: 0x000e0307 emmc: 250000000 actual: 41666666 div: 0x00000003 target: 40000000 min: 0 max: 40000000 delay: 1 003162.027: brfs: File read: /mfs/sd/config.txt 003559.807: brfs: File read: 1807 bytes 003585.283: brfs: File read: /mfs/sd/config.txt 003591.895: brfs: File read: 1807 bytes 004038.413: gpioman: gpioman_get_pin_num: pin DISPLAY_DSI_PORT not defined 004341.013: *** Restart logging 004345.190: hdmi: HDMI:hdmi_get_state is deprecated, use hdmi_get_display_state instead 004348.598: hdmi: HDMI:hdmi_get_state is deprecated, use hdmi_get_display_state instead 004348.635: HDMI0: hdmi_pixel_encoding: 300000000 004348.649: HDMI1: hdmi_pixel_encoding: 300000000 004359.778: brfs: File read: /mfs/sd/bcm2711-rpi-4-b.dtb 004359.821: Loading 'bcm2711-rpi-4-b.dtb' to 0x100 size 0xa139 004421.614: brfs: File read: 41273 bytes 004730.121: brfs: File read: /mfs/sd/overlays/rpi-ft5406.dtbo 004760.742: Loaded overlay 'rpi-ft5406' 004857.321: brfs: File read: 1018 bytes 004866.770: brfs: File read: /mfs/sd/overlays/rpi-backlight.dtbo 004897.060: Loaded overlay 'rpi-backlight' 004972.289: brfs: File read: 489 bytes 004976.353: brfs: File read: /mfs/sd/config.txt 004976.811: dtparam: spi=on 005013.947: dtparam: audio=on 005061.091: brfs: File read: 1807 bytes 005073.416: brfs: File read: /mfs/sd/overlays/vc4-fkms-v3d.dtbo 005149.662: Loaded overlay 'vc4-fkms-v3d' 005288.113: brfs: File read: 1255 bytes 005290.434: brfs: File read: /mfs/sd/overlays/vc4-fkms-v3d.dtbo 005366.700: Loaded overlay 'vc4-fkms-v3d' 005533.912: brfs: File read: 1255 bytes 005537.384: brfs: File read: /mfs/sd/cmdline.txt 005537.440: Read command line from file 'cmdline.txt': 005537.471: 'console=serial0,115200 console=tty1 root=PARTUUID=6c586e13-02 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait' 007146.880: brfs: File read: 121 bytes 007617.930: brfs: File read: /mfs/sd/kernel7l.img 007617.968: Loading 'kernel7l.img' to 0x8000 size 0x588460 007618.015: Device tree loaded to 0x2eff5800 (size 0xa7ce) 007620.218: gpioman: gpioman_get_pin_num: pin SDCARD_CONTROL_POWER not defined 008995.741: vchiq_core: vchiq_init_state: slot_zero = 0xded80000, is_master = 1 dmesg log --------- [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 4.19.118-v7l+ (dom@buildbot) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611)) #1311 SMP Mon Apr 27 14:26:42 BST 2020 [ 0.000000] CPU: ARMv7 Processor [410fd083] revision 3 (ARMv7), cr=30c5383d [ 0.000000] CPU: div instructions available: patching division code [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache [ 0.000000] OF: fdt: Machine model: Raspberry Pi 4 Model B Rev 1.1 [ 0.000000] Memory policy: Data cache writealloc [ 0.000000] cma: Reserved 256 MiB at 0x000000001ec00000 [ 0.000000] On node 0 totalpages: 458752 [ 0.000000] DMA zone: 1728 pages used for memmap [ 0.000000] DMA zone: 0 pages reserved [ 0.000000] DMA zone: 196608 pages, LIFO batch:63 [ 0.000000] HighMem zone: 262144 pages, LIFO batch:63 [ 0.000000] random: get_random_bytes called from start_kernel+0xc0/0x4e8 with crng_init=0 [ 0.000000] percpu: Embedded 17 pages/cpu s36928 r8192 d24512 u69632 [ 0.000000] pcpu-alloc: s36928 r8192 d24512 u69632 alloc=17*4096 [ 0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 457024 [ 0.000000] Kernel command line: coherent_pool=1M 8250.nr_uarts=1 cma=64M snd_bcm2835.enable_compat_alsa=0 snd_bcm2835.enable_hdmi=1 snd_bcm2835.enable_headphones=1 cma=256M cma=256M smsc95xx.macaddr=m.m.m.m vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000 console=ttyS0,115200 console=tty1 root=PARTUUID=6c586e13-02 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait [ 0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) [ 0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) [ 0.000000] Memory: 1539092K/1835008K available (8192K kernel code, 690K rwdata, 2420K rodata, 2048K init, 860K bss, 33772K reserved, 262144K cma-reserved, 1048576K highmem) [ 0.000000] Virtual kernel memory layout: vector : 0xffff0000 - 0xffff1000 ( 4 kB) fixmap : 0xffc00000 - 0xfff00000 (3072 kB) vmalloc : 0xf0800000 - 0xff800000 ( 240 MB) lowmem : 0xc0000000 - 0xf0000000 ( 768 MB) pkmap : 0xbfe00000 - 0xc0000000 ( 2 MB) modules : 0xbf000000 - 0xbfe00000 ( 14 MB) .text : 0x(ptrval) - 0x(ptrval) (10208 kB) .init : 0x(ptrval) - 0x(ptrval) (2048 kB) .data : 0x(ptrval) - 0x(ptrval) ( 691 kB) .bss : 0x(ptrval) - 0x(ptrval) ( 861 kB) [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1 [ 0.000000] ftrace: allocating 28928 entries in 85 pages [ 0.000000] rcu: Hierarchical RCU implementation. [ 0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16 [ 0.000000] GIC: Using split EOI/Deactivate mode [ 0.000000] arch_timer: cp15 timer(s) running at 54.00MHz (phys). [ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0xc743ce346, max_idle_ns: 440795203123 ns [ 0.000005] sched_clock: 56 bits at 54MHz, resolution 18ns, wraps every 4398046511102ns [ 0.000021] Switching to timer-based delay loop, resolution 18ns [ 0.000251] Console: colour dummy device 80x30 [ 0.000715] console [tty1] enabled [ 0.000772] Calibrating delay loop (skipped), value calculated using timer frequency.. 108.00 BogoMIPS (lpj=540000) [ 0.000808] pid_max: default: 32768 minimum: 301 [ 0.001089] Mount-cache hash table entries: 2048 (order: 1, 8192 bytes) [ 0.001121] Mountpoint-cache hash table entries: 2048 (order: 1, 8192 bytes) [ 0.001875] CPU: Testing write buffer coherency: ok [ 0.002290] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000 [ 0.002933] Setting up static identity map for 0x200000 - 0x20003c [ 0.003104] rcu: Hierarchical SRCU implementation. [ 0.003966] smp: Bringing up secondary CPUs ... [ 0.004766] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001 [ 0.005669] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002 [ 0.006527] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003 [ 0.006645] smp: Brought up 1 node, 4 CPUs [ 0.006712] SMP: Total of 4 processors activated (432.00 BogoMIPS). [ 0.006735] CPU: All CPU(s) started in HYP mode. [ 0.006755] CPU: Virtualization extensions available. [ 0.007523] devtmpfs: initialized [ 0.018165] VFP support v0.3: implementor 41 architecture 3 part 40 variant 8 rev 0 [ 0.018389] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns [ 0.018430] futex hash table entries: 1024 (order: 4, 65536 bytes) [ 0.025780] pinctrl core: initialized pinctrl subsystem [ 0.026651] NET: Registered protocol family 16 [ 0.029398] DMA: preallocated 1024 KiB pool for atomic coherent allocations [ 0.030930] hw-breakpoint: found 5 (+1 reserved) breakpoint and 4 watchpoint registers. [ 0.030961] hw-breakpoint: maximum watchpoint size is 8 bytes. [ 0.031164] Serial: AMBA PL011 UART driver [ 0.034525] bcm2835-mbox fe00b880.mailbox: mailbox enabled [ 0.050015] raspberrypi-firmware soc:firmware: Attached to firmware from 2020-05-11 18:59, variant start [ 0.060031] raspberrypi-firmware soc:firmware: Firmware hash is 21bfdeee3a6ea823e2113b983390acd1eec8edfb [ 0.102574] bcm2835-dma fe007000.dma: DMA legacy API manager at (ptrval), dmachans=0x1 [ 0.105764] vgaarb: loaded [ 0.106131] SCSI subsystem initialized [ 0.106332] usbcore: registered new interface driver usbfs [ 0.106403] usbcore: registered new interface driver hub [ 0.106486] usbcore: registered new device driver usb [ 0.107810] clocksource: Switched to clocksource arch_sys_counter [ 0.185298] VFS: Disk quotas dquot_6.6.0 [ 0.185405] VFS: Dquot-cache hash table entries: 1024 (order 0, 4096 bytes) [ 0.185557] FS-Cache: Loaded [ 0.185740] CacheFiles: Loaded [ 0.186290] simple-framebuffer 3ea50000.framebuffer: framebuffer at 0x3ea50000, 0x177000 bytes, mapped to 0x(ptrval) [ 0.186329] simple-framebuffer 3ea50000.framebuffer: format=a8r8g8b8, mode=800x480x32, linelength=3200 [ 0.188697] Console: switching to colour frame buffer device 100x30 [ 0.191639] simple-framebuffer 3ea50000.framebuffer: fb0: simplefb registered! [ 0.201750] NET: Registered protocol family 2 [ 0.203464] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 6144 bytes) [ 0.204582] TCP established hash table entries: 8192 (order: 3, 32768 bytes) [ 0.205734] TCP bind hash table entries: 8192 (order: 4, 65536 bytes) [ 0.206871] TCP: Hash tables configured (established 8192 bind 8192) [ 0.208074] UDP hash table entries: 512 (order: 2, 16384 bytes) [ 0.209145] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes) [ 0.210381] NET: Registered protocol family 1 [ 0.211948] RPC: Registered named UNIX socket transport module. [ 0.212987] RPC: Registered udp transport module. [ 0.214015] RPC: Registered tcp transport module. [ 0.215015] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 0.216054] PCI: CLS 0 bytes, default 64 [ 0.217962] hw perfevents: enabled with armv7_cortex_a15 PMU driver, 7 counters available [ 0.221135] Initialise system trusted keyrings [ 0.222379] workingset: timestamp_bits=14 max_order=19 bucket_order=5 [ 0.232511] FS-Cache: Netfs 'nfs' registered for caching [ 0.234093] NFS: Registering the id_resolver key type [ 0.235180] Key type id_resolver registered [ 0.236253] Key type id_legacy registered [ 0.237306] nfs4filelayout_init: NFSv4 File Layout Driver Registering... [ 0.242868] Key type asymmetric registered [ 0.243906] Asymmetric key parser 'x509' registered [ 0.245014] bounce: pool size: 64 pages [ 0.246027] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 250) [ 0.247203] io scheduler noop registered [ 0.248249] io scheduler deadline registered (default) [ 0.249376] io scheduler cfq registered [ 0.250317] io scheduler mq-deadline registered (default) [ 0.251249] io scheduler kyber registered [ 0.254765] brcm-pcie fd500000.pcie: could not get clock [ 0.255735] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges: [ 0.256676] brcm-pcie fd500000.pcie: MEM 0x600000000..0x603ffffff -> 0xf8000000 [ 0.317846] brcm-pcie fd500000.pcie: link up, 5.0 Gbps x1 (!SSC) [ 0.318992] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00 [ 0.319900] pci_bus 0000:00: root bus resource [bus 00-01] [ 0.320834] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff]) [ 0.322723] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400 [ 0.322865] pci 0000:00:00.0: PME# supported from D0 D3hot [ 0.325642] PCI: bus0: Fast back to back transfers disabled [ 0.326589] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring [ 0.327748] pci 0000:01:00.0: [1106:3483] type 00 class 0x0c0330 [ 0.327931] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00000fff 64bit] [ 0.328244] pci 0000:01:00.0: PME# supported from D0 D3cold [ 0.331027] PCI: bus1: Fast back to back transfers disabled [ 0.331995] pci_bus 0000:01: busn_res: [bus 01] end is updated to 01 [ 0.332040] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff] [ 0.333032] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x600000fff 64bit] [ 0.334064] pci 0000:00:00.0: PCI bridge to [bus 01] [ 0.335040] pci 0000:00:00.0: bridge window [mem 0x600000000-0x6000fffff] [ 0.336216] pcieport 0000:00:00.0: enabling device (0140 -> 0142) [ 0.337350] pcieport 0000:00:00.0: Signaling PME with IRQ 54 [ 0.338497] pcieport 0000:00:00.0: AER enabled with IRQ 54 [ 0.339611] pci 0000:01:00.0: enabling device (0140 -> 0142) [ 0.342994] Serial: 8250/16550 driver, 1 ports, IRQ sharing enabled [ 0.345945] iproc-rng200 fe104000.rng: hwrng registered [ 0.347109] vc-mem: phys_addr:0x00000000 mem_base=0x3ec00000 mem_size:0x40000000(1024 MiB) [ 0.348647] vc-sm: Videocore shared memory driver [ 0.350043] gpiomem-bcm2835 fe200000.gpiomem: Initialised: Registers at 0xfe200000 [ 0.360986] brd: module loaded [ 0.371476] loop: module loaded [ 0.373258] Loading iSCSI transport class v2.0-870. [ 0.375930] libphy: Fixed MDIO Bus: probed [ 0.377367] bcmgenet fd580000.genet: failed to get enet clock [ 0.378455] bcmgenet fd580000.genet: GENET 5.0 EPHY: 0x0000 [ 0.379481] bcmgenet fd580000.genet: failed to get enet-wol clock [ 0.380494] bcmgenet fd580000.genet: failed to get enet-eee clock [ 0.381458] bcmgenet: Skipping UMAC reset [ 0.382619] unimac-mdio unimac-mdio.-19: DMA mask not set [ 0.397873] libphy: bcmgenet MII bus: probed [ 0.438342] unimac-mdio unimac-mdio.-19: Broadcom UniMAC MDIO bus at 0x(ptrval) [ 0.440150] usbcore: registered new interface driver r8152 [ 0.441146] usbcore: registered new interface driver lan78xx [ 0.442105] usbcore: registered new interface driver smsc95xx [ 0.443302] xhci_hcd 0000:01:00.0: xHCI Host Controller [ 0.444210] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 1 [ 0.446860] xhci_hcd 0000:01:00.0: hcc params 0x002841eb hci version 0x100 quirks 0x0000001000000890 [ 0.449027] genirq: irq_chip Brcm_MSI did not update eff. affinity mask of irq 55 [ 0.450798] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 4.19 [ 0.451804] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 0.452781] usb usb1: Product: xHCI Host Controller [ 0.453725] usb usb1: Manufacturer: Linux 4.19.118-v7l+ xhci-hcd [ 0.454658] usb usb1: SerialNumber: 0000:01:00.0 [ 0.456048] hub 1-0:1.0: USB hub found [ 0.457032] hub 1-0:1.0: 1 port detected [ 0.458430] xhci_hcd 0000:01:00.0: xHCI Host Controller [ 0.459350] xhci_hcd 0000:01:00.0: new USB bus registered, assigned bus number 2 [ 0.460276] xhci_hcd 0000:01:00.0: Host supports USB 3.0 SuperSpeed [ 0.461553] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 4.19 [ 0.462526] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 0.463517] usb usb2: Product: xHCI Host Controller [ 0.464492] usb usb2: Manufacturer: Linux 4.19.118-v7l+ xhci-hcd [ 0.465472] usb usb2: SerialNumber: 0000:01:00.0 [ 0.466930] hub 2-0:1.0: USB hub found [ 0.467963] hub 2-0:1.0: 4 ports detected [ 0.470094] dwc_otg: version 3.00a 10-AUG-2012 (platform bus) [ 0.471238] dwc_otg: FIQ enabled [ 0.471247] dwc_otg: NAK holdoff enabled [ 0.471256] dwc_otg: FIQ split-transaction FSM enabled [ 0.471268] Module dwc_common_port init [ 0.471471] usbcore: registered new interface driver uas [ 0.472530] usbcore: registered new interface driver usb-storage [ 0.473637] mousedev: PS/2 mouse device common for all mice [ 0.475776] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer [ 0.478411] sdhci: Secure Digital Host Controller Interface driver [ 0.479349] sdhci: Copyright(c) Pierre Ossman [ 0.480633] mmc-bcm2835 fe300000.mmcnr: could not get clk, deferring probe [ 0.481973] sdhci-pltfm: SDHCI platform and OF driver helper [ 0.485855] ledtrig-cpu: registered to indicate activity on CPUs [ 0.486904] hidraw: raw HID events driver (C) Jiri Kosina [ 0.487974] usbcore: registered new interface driver usbhid [ 0.488887] usbhid: USB HID core driver [ 0.490448] vchiq: vchiq_init_state: slot_zero = (ptrval), is_master = 0 [ 0.492708] [vc_sm_connected_init]: start [ 0.498922] [vc_sm_connected_init]: end - returning 0 [ 0.500954] Initializing XFRM netlink socket [ 0.501904] NET: Registered protocol family 17 [ 0.502888] Key type dns_resolver registered [ 0.504155] Registering SWP/SWPB emulation handler [ 0.505606] registered taskstats version 1 [ 0.506486] Loading compiled-in X.509 certificates [ 0.514225] uart-pl011 fe201000.serial: cts_event_workaround enabled [ 0.515197] fe201000.serial: ttyAMA0 at MMIO 0xfe201000 (irq = 34, base_baud = 0) is a PL011 rev2 [ 0.521453] console [ttyS0] disabled [ 0.522432] fe215040.serial: ttyS0 at MMIO 0x0 (irq = 36, base_baud = 62500000) is a 16550 [ 1.722432] console [ttyS0] enabled [ 1.727551] bcm2835-power bcm2835-power: Broadcom BCM2835 power domains driver [ 1.736704] brcmstb_thermal fd5d2200.thermal: registered AVS TMON of-sensor driver [ 1.746081] mmc-bcm2835 fe300000.mmcnr: mmc_debug:0 mmc_debug2:0 [ 1.753246] mmc-bcm2835 fe300000.mmcnr: DMA channel allocated [ 1.786166] sdhci-iproc fe340000.emmc2: Linked as a consumer to regulator.3 [ 1.794368] sdhci-iproc fe340000.emmc2: Linked as a consumer to regulator.4 [ 1.821662] mmc1: queuing unknown CIS tuple 0x80 (2 bytes) [ 1.829947] mmc1: queuing unknown CIS tuple 0x80 (3 bytes) [ 1.838075] mmc1: queuing unknown CIS tuple 0x80 (3 bytes) [ 1.844616] mmc0: SDHCI controller on fe340000.emmc2 [fe340000.emmc2] using ADMA [ 1.855357] of_cfs_init [ 1.858870] of_cfs_init: OK [ 1.863172] Waiting for root device PARTUUID=6c586e13-02... [ 1.872450] mmc1: queuing unknown CIS tuple 0x80 (7 bytes) [ 1.880589] mmc1: queuing unknown CIS tuple 0x80 (3 bytes) [ 1.935846] random: fast init done [ 1.978446] usb 1-1: new high-speed USB device number 2 using xhci_hcd [ 1.995194] mmc0: new ultra high speed DDR50 SDHC card at address e624 [ 2.005776] mmcblk0: mmc0:e624 SP32G 29.7 GiB [ 2.012312] mmc1: new high speed SDIO card at address 0001 [ 2.021968] mmcblk0: p1 p2 [ 2.058715] EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null) [ 2.068143] VFS: Mounted root (ext4 filesystem) readonly on device 179:2. [ 2.082874] devtmpfs: mounted [ 2.093953] Freeing unused kernel memory: 2048K [ 2.118067] Run /sbin/init as init process [ 2.170467] usb 1-1: New USB device found, idVendor=2109, idProduct=3431, bcdDevice= 4.21 [ 2.179757] usb 1-1: New USB device strings: Mfr=0, Product=1, SerialNumber=0 [ 2.187943] usb 1-1: Product: USB2.0 Hub [ 2.194506] hub 1-1:1.0: USB hub found [ 2.199537] hub 1-1:1.0: 4 ports detected [ 2.521682] systemd[1]: System time before build time, advancing clock. [ 2.612032] NET: Registered protocol family 10 [ 2.618747] Segment Routing with IPv6 [ 2.656956] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid) [ 2.682698] systemd[1]: Detected architecture arm. [ 2.707734] systemd[1]: Set hostname to . [ 3.375542] random: systemd: uninitialized urandom read (16 bytes read) [ 3.418442] systemd[1]: Condition check resulted in Journal Audit Socket being skipped. [ 3.427788] random: systemd: uninitialized urandom read (16 bytes read) [ 3.436388] systemd[1]: Listening on Journal Socket. [ 3.444657] random: systemd: uninitialized urandom read (16 bytes read) [ 3.461147] systemd[1]: Mounting POSIX Message Queue File System... [ 3.470835] systemd[1]: Started Forward Password Requests to Wall Directory Watch. [ 3.482123] systemd[1]: Condition check resulted in Huge Pages File System being skipped. [ 3.496937] systemd[1]: Starting Set the console keyboard layout... [ 3.507434] systemd[1]: Listening on udev Control Socket. [ 4.104587] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null) [ 4.192227] systemd-journald[130]: Received request to flush runtime journal from PID 1 [ 4.754754] rpi-ft5406 rpi_ft5406: Probing device [ 4.763963] input: FT5406 memory based driver as /devices/virtual/input/input0 [ 4.816150] rpivid-mem feb00000.hevc-decoder: rpivid-hevcmem initialised: Registers at 0xfeb00000 length 0x0000ffff [ 4.816686] rpivid-mem feb10000.rpivid-local-intc: rpivid-intcmem initialised: Registers at 0xfeb10000 length 0x00000fff [ 4.817138] rpivid-mem feb20000.h264-decoder: rpivid-h264mem initialised: Registers at 0xfeb20000 length 0x0000ffff [ 4.819079] rpivid-mem feb30000.vp9-decoder: rpivid-vp9mem initialised: Registers at 0xfeb30000 length 0x0000ffff [ 4.856151] vc_sm_cma: module is from the staging directory, the quality is unknown, you have been warned. [ 4.858980] bcm2835_vc_sm_cma_probe: Videocore shared memory driver [ 4.858997] [vc_sm_connected_init]: start [ 4.876821] media: Linux media interface: v0.10 [ 4.880005] [vc_sm_connected_init]: installed successfully [ 4.897251] videodev: Linux video capture interface: v2.00 [ 4.968485] bcm2835_mmal_vchiq: module is from the staging directory, the quality is unknown, you have been warned. [ 4.987265] bcm2835_codec: module is from the staging directory, the quality is unknown, you have been warned. [ 4.991272] bcm2835_v4l2: module is from the staging directory, the quality is unknown, you have been warned. [ 5.010994] bcm2835-codec bcm2835-codec: Device registered as /dev/video10 [ 5.011028] bcm2835-codec bcm2835-codec: Loaded V4L2 decode [ 5.022520] bcm2835-codec bcm2835-codec: Device registered as /dev/video11 [ 5.022551] bcm2835-codec bcm2835-codec: Loaded V4L2 encode [ 5.039644] bcm2835-codec bcm2835-codec: Device registered as /dev/video12 [ 5.039685] bcm2835-codec bcm2835-codec: Loaded V4L2 isp [ 5.126160] snd_bcm2835: module is from the staging directory, the quality is unknown, you have been warned. [ 5.131620] bcm2835_audio soc:audio: card created with 8 channels [ 5.181844] cfg80211: Loading compiled-in X.509 certificates for regulatory database [ 5.243161] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7' [ 5.289655] brcmfmac: F1 signature read @0x18000000=0x15264345 [ 5.308034] brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac43455-sdio for chip BCM4345/6 [ 5.308656] usbcore: registered new interface driver brcmfmac [ 5.334343] [drm] Initialized v3d 1.0.0 20180419 for fec00000.v3d on minor 0 [ 5.368907] vc4-drm soc:gpu: bound fe600000.firmwarekms (ops vc4_fkms_ops [vc4]) [ 5.368925] checking generic (3ea50000 177000) vs hw (0 ffffffffffffffff) [ 5.368936] fb: switching to vc4drmfb from simple [ 5.369148] Console: switching to colour dummy device 80x30 [ 5.371199] [drm] Initialized vc4 0.0.0 20140616 for soc:gpu on minor 1 [ 5.371212] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 5.371221] [drm] No driver support for vblank timestamp query. [ 5.371230] [drm] Setting vblank_disable_immediate to false because get_vblank_timestamp == NULL [ 5.382243] Console: switching to colour frame buffer device 100x30 [ 5.402233] vc4-drm soc:gpu: fb0: DRM emulated frame buffer device [ 5.561459] brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac43455-sdio for chip BCM4345/6 [ 5.574380] brcmfmac: brcmf_c_preinit_dcmds: Firmware: BCM4345/6 wl0: Mar 2 2020 23:30:41 version 7.45.202 (r724630 CY) FWID 01-72f6ece2 [ 6.908874] uart-pl011 fe201000.serial: no DMA platform data [ 6.915218] random: crng init done [ 6.915233] random: 7 urandom warning(s) missed due to ratelimiting [ 7.124027] 8021q: 802.1Q VLAN Support v1.8 [ 7.316545] Adding 102396k swap on /var/swap. Priority:-2 extents:1 across:102396k SSFS [ 7.516855] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 7.516890] brcmfmac: brcmf_cfg80211_set_power_mgmt: power save enabled [ 7.828717] bcmgenet: Skipping UMAC reset [ 7.938526] bcmgenet fd580000.genet: configuring instance for external RGMII (no delay) [ 7.939221] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 8.967949] bcmgenet fd580000.genet eth0: Link is Down [ 13.045512] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready [ 13.128032] bcmgenet fd580000.genet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [ 13.128061] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 13.691050] Bluetooth: Core ver 2.22 [ 13.691123] NET: Registered protocol family 31 [ 13.691134] Bluetooth: HCI device and connection manager initialized [ 13.691154] Bluetooth: HCI socket layer initialized [ 13.691169] Bluetooth: L2CAP socket layer initialized [ 13.691212] Bluetooth: SCO socket layer initialized [ 13.705174] Bluetooth: HCI UART driver ver 2.3 [ 13.705188] Bluetooth: HCI UART protocol H4 registered [ 13.705270] Bluetooth: HCI UART protocol Three-wire (H5) registered [ 13.705512] Bluetooth: HCI UART protocol Broadcom registered [ 13.921831] Bluetooth: BNEP (Ethernet Emulation) ver 1.3 [ 13.921838] Bluetooth: BNEP filters: protocol multicast [ 13.921848] Bluetooth: BNEP socket layer initialized [ 13.999325] Bluetooth: RFCOMM TTY layer initialized [ 13.999344] Bluetooth: RFCOMM socket layer initialized [ 13.999360] Bluetooth: RFCOMM ver 1.11 [ 14.856301] ICMPv6: process `dhcpcd' is using deprecated sysctl (syscall) net.ipv6.neigh.wlan0.retrans_time - use net.ipv6.neigh.wlan0.retrans_time_ms instead ```

Additional context The content of the FB I'm presenting is from GL (that's the reason for the y-reflection).

pelwell commented 4 years ago

What's your use case?

From your firmware log you appear to be loading the vc4-fkms-v3d overlay twice, which could be confusing (but not the cause of the delay).

ardera commented 4 years ago

From your firmware log you appear to be loading the vc4-fkms-v3d overlay twice, which could be confusing (but not the cause of the delay).

Okay, I'll change it anyway to see if it fixes something.

What's your use case?

I'm using omxplayer to implement video playback for flutter-pi. I'm using DRM planes to draw stuff behind and in front of omxplayer.

6by9 commented 4 years ago

I've added logging of the timing around the drmModeSetPlane call in drm_mmal. I'm seeing either a couple of ms, or 15ms. (My display is running at 60Hz). I've inserted a 1s delay between each drmModeSetPlane call so that it is obvious what each update is doing.

Looking at both the firmware I'm not seeing any big delays on any of the updates. It says the update via the mailbox call is completed in under 1ms. The kernel logging shows entries similar to

[ 4423.438234] [drm:drm_atomic_state_init [drm]] Allocated atomic state 7b73c98d
[ 4423.438426] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:38:plane-1] bc2c41fb state to 7b73c98d
[ 4423.438616] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:52:crtc-0] d1ac5951 state to 7b73c98d
[ 4423.438801] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:67] for [PLANE:38:plane-1] state bc2c41fb
[ 4423.438986] [drm:drm_atomic_check_only [drm]] checking 7b73c98d
[ 4423.439084] [drm:vc4_plane_atomic_check [vc4]] [PLANE:38:plane-1] plane update 1920x1080@3 +dst(0,0, 1360,768) +src(0,0, 125829120,70778880) 0xe1100000/e12fe000/e137d800/1920, alpha 65535 zpos 1
[ 4423.439161] [drm:vc4_crtc_atomic_check [vc4]] [CRTC:52] crtc_atomic_check.
[ 4423.439348] [drm:drm_atomic_commit [drm]] committing 7b73c98d
[ 4423.439446] [drm:vc4_plane_set_blank [vc4]] [PLANE:38:plane-1] overlay plane unblank
[ 4423.439828] [drm:vc4_crtc_atomic_flush [vc4]] [CRTC:52] crtc_atomic_flush.
[ 4423.448128] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 7b73c98d
[ 4423.448320] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 7b73c98d

So we do have a delay between atomic_flush and state clear and free.

4.19 looks like there is a potentially a delay for vsync during commit - https://github.com/raspberrypi/linux/blob/rpi-4.19.y/drivers/gpu/drm/drm_plane_helper.c#L412

I'm on the 5.4 kernel. I haven't checked whether we go through the path, but drm_atomic_helper_commit_tail looks like it will block. Actually I think we use vc4_atomic_commit. That calls vc4_atomic_complete_commit, which calls drm_atomic_helper_wait_for_flip_done.

The DRM API docs are a nightmare to find out what is actually the expected behaviour, but I think that drmModeSetPlane will block for vsync when sitting on top of an atomic driver. Unfortunately I don't immediately see API calls that do allow for a non-blocking update. I suspect that libdrm may not allow it, and potentially the code in the igt repo (Intel Graphics Tests) may be more suitable. Sorry, I'm feeling my way around this lot as much as anyone.

ardera commented 4 years ago

Just edited my top comment to include a reduced test case.

I consistently get ~14.7ms there, so I'm able to reproduce the vsync delay. However 14.7ms is still far from the delays I was experiencing when running my main application. Maybe omxplayer somehow tries to get the displays refresh rate to match that of the video, though I disabled all the options regarding that feature. If I find that it's a kernel problem, I'll open a new issue.

The DRM API docs are a nightmare to find out what is actually the expected behaviour, but I think that drmModeSetPlane will block for vsync when sitting on top of an atomic driver.

Yeah, I don't know either. I heard something about drmModeSetPlane et al being "unsynchronized nightmares" somewhere (that's why I assumed it was not vsynced), but nothing specific about atomic was mentioned. In contrast this nvidia doc says the nvidia driver, too, is waiting for vblank.

So I guess I'll have to switch to atomic then.

EDIT: Seems like, in my main application, drmModeSetPlane ~~sometimes~~ always misses a vblank. So the drmModeSetPlane call is executing for like 12ms, a page flip event for my primary plane ocurrs and 14.5ms after the page flip, the call to drmModeSetPlane returns. I'll try to build a test case for that.

(I'm drawing to two planes simultaneously, and the page flip event is triggered by the drmModePageFlip for the primary plane.)

ardera commented 4 years ago

Updated my reproduction code to reproduce the issue when concurrently drawing to two planes.

I now consistently get 29.5ms delay with drmModeSetPlane with that code. Expected would be 14.5ms, as there's no reason for drmModeSetPlane to wait for two vblanks (that I know of, at least)

beforetested commented 2 years ago

Were you able to find a solution to this? I'm running into the same issue.

popcornmix commented 2 years ago

Are you using the kms or fkms driver? You should be on kms (default with RpiOS bullseye).

beforetested commented 2 years ago

I'm on KMS

popcornmix commented 2 years ago

I believe drmModeAtomicCommit is expected to block for up to a vsync. Look into the flags DRM_MODE_ATOMIC_NONBLOCK and DRM_MODE_PAGE_FLIP_ASYNC if you don't want this behaviour.

beforetested commented 2 years ago

I'm experimenting with updating 2 planes in a single vblank. If I call drmModeAtomicCommit with DRM_MODE_ATOMIC_NONBLOCK twice during the same vblank, the call fails with EBUSY.

I've also tried calling drmModeSetPlane and drmModeAtomicCommit from different threads, though each call will block for 2 vlanks instead of 1.

popcornmix commented 2 years ago

Hard to comment without seeing exactly what you are doing. I can say that kodi runs on kms, and displays a video and a gui plane and can render 60fps video, so updating 2 planes in a single vsync is possible.

ardera commented 2 years ago

I'm experimenting with updating 2 planes in a single vblank. If I call drmModeAtomicCommit with DRM_MODE_ATOMIC_NONBLOCK twice during the same vblank, the call fails with EBUSY.

I've also tried calling drmModeSetPlane and drmModeAtomicCommit from different threads, though each call will block for 2 vlanks instead of 1.

I think that's expected, if you want to update 2 planes in a single vblank, you have to put them both in the same atomic request

That drmModeSetPlane takes that long to execute is expected as well (AFAICT), at least since atomic KMS is around

EDIT: Though I absolutely agree, KMS synchronization could be better documented. But that's not a problem of the raspberry pi kernel really

beforetested commented 2 years ago

Thank you! This is exactly what I needed.

Consti10 commented 2 years ago

Anybody here who had success of using drmModeSetPlane() without VSYNC ?

I managed to write into the frame buffer directly using CPU & mmap: https://github.com/Consti10/hello_drmprime/blob/master/drm-howto/modeset_latency.cpp

But all methods exposed by drm that swap the underlying frame buffer seem to force VSYNC.

ardera commented 2 years ago

@Consti10 DRM_MODE_PAGE_FLIP_ASYNC is the flag you're looking for. If you put that as an argument to drmModeAtomicCommit, it'll change the fb without waiting for vsync (See also popcornmix' reply above)

Consti10 commented 2 years ago

Do you know how to add this flag for example to drmModeSetCrtc() ?

I've tried this snippet, but it doesn't have any effect on rpi. https://github.com/Consti10/hello_drmprime/blob/master/drm-howto/modeset-double-buffered_latency.cpp#L568

Aka drmModeSetCrtc() takes: Avg SwapBuffers:min=564.344971us max=21.431000ms avg=8.972000ms One can nicely see here how it blocks on average 8.9ms (1/2 a VSYNC).

ardera commented 2 years ago

There's no way to provide this flag to drmModeSetCrtc, even the underlying ioctl doesn't support it.

You can add it to drmModePageFlip though[^1]. In this case, just replace DRM_MODE_PAGE_FLIP_EVENT with DRM_MODE_PAGE_FLIP_EVENT | DRM_MODE_PAGE_FLIP_ASYNC.

[^1]: Not entirely sure. It should work, but the introduction of atomic modesetting changed some of the semantics (because they internally now emulate legacy modesetting (== drmModeSetCrtc, drmModePageFlip, drmModeSetPlane, etc) via atomic modesetting). If this doesn't work, you can try using atomic modesetting

Consti10 commented 2 years ago

There's no way to provide this flag to drmModeSetCrtc, even the underlying ioctl doesn't support it.

You can add it to drmModePageFlip though1. In this case, just replace DRM_MODE_PAGE_FLIP_EVENT with DRM_MODE_PAGE_FLIP_EVENT | DRM_MODE_PAGE_FLIP_ASYNC.

Footnotes

Not entirely sure. It should work, but the introduction of atomic modesetting changed some of the semantics (because they internally now emulate legacy modesetting (== drmModeSetCrtc, drmModePageFlip, drmModeSetPlane, etc) via atomic modesetting). If this doesn't work, you can try using atomic modesetting leftwards_arrow_with_hook

Doesn't look like it works. even though I pass the flag to drmModePageFlip() I still cannot observe any tearing.

This really sucks. There is literally no way to disable VSYNC on rpi when using DRM api(s).

In our specific use case (low latency video decoding and display) the only option would be to create a single frame buffer, mmap it to user space (like done here: https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c) and copy the whole raw data of the decoded frame via CPU into this mmaped buffer.

ardera commented 2 years ago

It should work. The driver has the async page flip capability, and I looked a bit in the src, there's code in place to make async commits work. (e.g. this is what commits the async update) So maybe there's a way to get it working.

Other than that, is up to 16ms of delay really too much for low latency video playback?

Consti10 commented 2 years ago

After a bit more investigation,yes, DRM_MODE_PAGE_FLIP_ASYNC has an effect. However, not the wanted one. With the flag not set, I get: Avg PageFlipRequest:min=1.602000ms max=3.708000ms avg=1.757000ms Avg FrameDelta:min=13.849000ms max=19.549000ms avg=16.863001ms

Aka the execution of drmModePageFlip() (which requests a page flip which is then handled later) takes 1.7ms on average, and the application is running at 60fps.

With the flag set, I get: Avg FrameDelta:min=748.119019us max=904.747986us avg=797.583984us Avg PageFlipRequest:min=736.416016us max=852.617981us avg=780.658020us

BUT no tearing. So I assume the driver is still internal syncing the swaps to the VSYNC, but doesn't block the callbacks anymore. This doesn't really help though ;/ we really need no VSYNC aka tearing (yes, we want tearing haha ;) ) for low latency.

Here is the code if anybody is interested: https://github.com/Consti10/hello_drmprime/blob/fad09ccfddcae3e29848e23cd700fbe2ab3714a6/drm-howto/modeset-vsync2.cpp#L690

Consti10 commented 2 years ago

What really sucks here is that to me, it looks as if some wrong changes introduced all these issues. For example, the documentation drm-kms reads - as an example: A call to drmModeSetCrtc(3) is executed immediately and forces the CRTC to use the new scanout buffer. Aka should just swap out the scanout buffer. But at least with the current rpi kernel, this is not true anymore, it waits for a VSYNC.

The documentation from "drm-howto" therefore is also outdated. For example, the modeset-double-buffered example is supposed to be double buffering without VSYNC. but due to the above issue, it actually is double buffered with VSYNC.

ardera commented 2 years ago

AFAICT, this is not a bug, just outdated documentation. Even in the upstream kernel source, it seems like every drmModeSetCrtc is implictly waiting for vblank. This seems to be the new behaviour since atomic kernel modesetting was introduced ~4-5 years ago.

So I assume the driver is still internal syncing the swaps to the VSYNC, but doesn't block the callbacks anymore.

I guess some component of it still is, yeah. I also tried and couldn't get tearing.

Consti10 commented 2 years ago

AFAICT, this is not a bug, just outdated documentation. Even in the upstream kernel source, it seems like every drmModeSetCrtc is implictly waiting for vblank. This seems to be the new behaviour since atomic kernel modesetting was introduced ~4-5 years ago.

So I assume the driver is still internal syncing the swaps to the VSYNC, but doesn't block the callbacks anymore.

I guess some component of it still is, yeah. I also tried and couldn't get tearing.

I can't believe there is no way to disable VSYNC in linux so to say - surely there is one ? I guess with OpenGL for example you can just get the framebuffer handel (the data area that is read out) and draw with it just like drawing into it with CPU, but for double buffering + no VSYNC you'd then need drm again.

Btw, my assumption how (for example) drmModeSetPlane is implemented on raspberry pi 4 (going down the route): 1) https://github.com/grate-driver/libdrm/blob/master/xf86drmMode.c#L988 2) https://github.com/raspberrypi/linux/blob/aeaa2460db088fb2c97ae56dec6d7d0058c68294/drivers/gpu/drm/drm_ioctl.c#L670 3)https://github.com/raspberrypi/linux/blob/rpi-5.10.y/drivers/gpu/drm/drm_plane.c#L800 4) https://github.com/raspberrypi/linux/blob/rpi-5.10.y/drivers/gpu/drm/drm_plane.c#L771

What happens here: https://github.com/raspberrypi/linux/blob/rpi-5.10.y/drivers/gpu/drm/drm_plane.c#L786

And where the actual VSYNC-ing happens i do not know.

Consti10 commented 2 years ago

I've added logging of the timing around the drmModeSetPlane call in drm_mmal. I'm seeing either a couple of ms, or 15ms. (My display is running at 60Hz). I've inserted a 1s delay between each drmModeSetPlane call so that it is obvious what each update is doing.

Looking at both the firmware I'm not seeing any big delays on any of the updates. It says the update via the mailbox call is completed in under 1ms. The kernel logging shows entries similar to
[ 4423.438234] [drm:drm_atomic_state_init [drm]] Allocated atomic state 7b73c98d
[ 4423.438426] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:38:plane-1] bc2c41fb state to 7b73c98d
[ 4423.438616] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:52:crtc-0] d1ac5951 state to 7b73c98d
[ 4423.438801] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:67] for [PLANE:38:plane-1] state bc2c41fb
[ 4423.438986] [drm:drm_atomic_check_only [drm]] checking 7b73c98d
[ 4423.439084] [drm:vc4_plane_atomic_check [vc4]] [PLANE:38:plane-1] plane update 1920x1080@3 +dst(0,0, 1360,768) +src(0,0, 125829120,70778880) 0xe1100000/e12fe000/e137d800/1920, alpha 65535 zpos 1
[ 4423.439161] [drm:vc4_crtc_atomic_check [vc4]] [CRTC:52] crtc_atomic_check.
[ 4423.439348] [drm:drm_atomic_commit [drm]] committing 7b73c98d
[ 4423.439446] [drm:vc4_plane_set_blank [vc4]] [PLANE:38:plane-1] overlay plane unblank
[ 4423.439828] [drm:vc4_crtc_atomic_flush [vc4]] [CRTC:52] crtc_atomic_flush.
[ 4423.448128] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 7b73c98d
[ 4423.448320] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 7b73c98d
So we do have a delay between atomic_flush and state clear and free.

4.19 looks like there is a potentially a delay for vsync during commit - https://github.com/raspberrypi/linux/blob/rpi-4.19.y/drivers/gpu/drm/drm_plane_helper.c#L412

I'm on the 5.4 kernel. I haven't checked whether we go through the path, but drm_atomic_helper_commit_tail looks like it will block. Actually I think we use vc4_atomic_commit. That calls vc4_atomic_complete_commit, which calls drm_atomic_helper_wait_for_flip_done.

The DRM API docs are a nightmare to find out what is actually the expected behaviour, but I think that drmModeSetPlane will block for vsync when sitting on top of an atomic driver. Unfortunately I don't immediately see API calls that do allow for a non-blocking update. I suspect that libdrm may not allow it, and potentially the code in the igt repo (Intel Graphics Tests) may be more suitable. Sorry, I'm feeling my way around this lot as much as anyone.

Am I right to assume that vc4 so to say is the rpi display output pipeline driver ? In theory, modifying it (removing drm_atomic_helper_wait_for_flip_done() or making it optional by env parameter) should lead to a immediate swap not synced with VSYNC (could introduce issues, but I doubt it, since in reality only the adress where the composer reads the pixels from is switched out by the driver). And this would be the only way if one wants to remove the forced VSYNC on drmXXX calls ?

Hacking the kernel for something that should be so simple is a bit overkill, but I can't see any alternative(s).

Consti10 commented 2 years ago

Btw, to elaborate on the "display pipeline latency" - I have written 2 test programs that allow you to measure the delay of rpi display pipeline. Each time you click a key, the LED of rpi is toggled on/off. And then a new solid colored image is drawn into the frame buffer. In case of no VSYNC, the CPU directly writes into the frame buffer that is currently read out, which takes ~5ms. In case of VSYNC, the front and back buffer is swapped out, which have been set to different colors before.

Then, by filming bot the LED and the connected monitor (HDMI, LG IPS Gaming 1ms response time, in theory capable of 144hz but only running on 60fps due to rpi limitations) one can measure the delay of the display pipeline.

My results: Each sample is "first change can be observed at top of screen" -> full screen is filled. First the frames, then in ms.

1) .modeset_latency, CPU draws frame directly: 3:5 => 12.48ms: 20.8ms 2:5 => 8.32ms : 20.8ms 3:8 => 8.32ms : 33.28ms

2) .modeset-double-buffered_latency, Swap front and back VSYNCed 7:8 => 29.12 : 33.28 4:6 => 16.64 : 24.96 6:7 => 24.96 : 29.12

popcornmix commented 2 years ago

Doesn't look like it works. even though I pass the flag to drmModePageFlip() I still cannot observe any tearing. This really sucks. There is literally no way to disable VSYNC on rpi when using DRM api(s). In our specific use case (low latency video decoding and display) the only option would be to create a single frame buffer, mmap it to user space (like done here: https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c) and copy the whole raw data of the decoded frame via CPU into this mmaped buffer.

I think this is probably true. The hardware composes a scene (which is a list of planes, with source and dest rectangles, alpha etc) from a display list in the HVS context RAM. The pointer to this display list is latched by hardware around the Vsync, so once per frame. You cannot make the hardware see changes to this poitner than once per frame.

As you say, you are free to modify the contents (pixels) of a plane while it is being scanned out and you can get tearing that way (and theoretically lower latency).

The other option, of leaving the display list pointer alone, but directly modifying the display list context ram, is I believe, in undefined behaviour territory. It is not how the hardware is designed and the kernel vc4 driver will not deliberately do this.

Consti10 commented 2 years ago

Doesn't look like it works. even though I pass the flag to drmModePageFlip() I still cannot observe any tearing. This really sucks. There is literally no way to disable VSYNC on rpi when using DRM api(s). In our specific use case (low latency video decoding and display) the only option would be to create a single frame buffer, mmap it to user space (like done here: https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c) and copy the whole raw data of the decoded frame via CPU into this mmaped buffer.

I think this is probably true. The hardware composes a scene (which is a list of planes, with source and dest rectangles, alpha etc) from a display list in the HVS context RAM. The pointer to this display list is latched by hardware around the Vsync, so once per frame. You cannot make the hardware see changes to this poitner than once per frame.

As you say, you are free to modify the contents (pixels) of a plane while it is being scanned out and you can get tearing that way (and theoretically lower latency).

The other option, of leaving the display list pointer alone, but directly modifying the display list context ram, is I believe, in undefined behaviour territory. It is not how the hardware is designed and the kernel vc4 driver will not deliberately do this.

I can confirm that "works" if you use a dump RGBA buffer and draw into it via CPU directly. I do not know if that would work with a SAND buffer (like required for h265 video), and I also don't know how to mmap both a drmPrime SAND buffer from the decoder as well as the composed buffer into user space for copying with CPU.

In the end, I really want to create a h265 video player application with as low delay as possible for rpi. Rn I am trying to find out what would be the least amount of effort ;) 1) doing it via CPU copy (not ideal, but possible in theory) 2) modify the kernel driver(s) 3) ??? ;)

popcornmix commented 2 years ago

The should be no issue with writing a sand buffer. The decoded buffers are typically dmabufs and can be mmapp-ed and accessed by CPU. The layout in RAM likely means a linear memcpy will produce tearing in 128-pixel wide vertical columns, rather than a horizontal tear line.

But you may need to explain further why tearing video is preferable to you - in general a lot of effort is made not to tear. Is this for streaming a game over a network?

I can't understand how you can avoid copying a buffer when doing what you want. The video decoder won't be able to decode consecutive video frames into the same buffer (it needs to keep them intact as reference frames). And copying frame buffers will take time (and so latency) - certainly copying 4K video frames is likely to take significant time (and likely to make real time decode infeasible).

Consti10 commented 2 years ago

The should be no issue with writing a sand buffer. The decoded buffers are typically dmabufs and can be mmapp-ed and accessed by CPU. The layout in RAM likely means a linear memcpy will produce tearing in 128-pixel wide vertical columns, rather than a horizontal tear line.

Yeah, can't find any examples to do so though ;) But you may need to explain further why tearing video is preferable to you - in general a lot of effort is made not to tear. Is this for streaming a game over a network?

Streaming a live video feed from a drone to rpi in real time. OpenHD. Especially if you are dealing with - for example 1080p90fps or 720p120fps - video, tearing is not noticeable to the average eye, but the latency reduction is noticeable.

I can't understand how you can avoid copying a buffer when doing what you want. The video decoder won't be able to decode consecutive video frames into the same buffer (it needs to keep them intact as reference frames). And copying frame buffers will take time (and so latency) - certainly copying 4K video frames is likely to take significant time (and likely to make real time decode infeasible).

Yes, this case would involve copying the frame via CPU, which is why I don't think it's ideal. For our resolutions (1080p and 720p) this should work though.

ardera commented 2 years ago

I can't believe there is no way to disable VSYNC in linux so to say - surely there is one ?

Well the only ways are AFAICT

modify the fb memory. Either create a dumb fb, or create a GBM bo and map it. See kmscube for an example KMS / GBM application
drmModePageFlip with DRM_MODE_PAGE_FLIP_ASYNC, which doesn't seem to work in this case

The other option, of leaving the display list pointer alone, but directly modifying the display list context ram, is I believe, in undefined behaviour territory. It is not how the hardware is designed and the kernel vc4 driver will not deliberately do this.

Not sure I understand, it seems like there's some code that implements async page flips. I mean it makes perfect sense if modifying the displaylist while it's being scanned out is undefined behaviour, just wondering why that code's there anyway. Actually, I thought perhaps the HVS was too fast and the FIFO it's putting the pixels in is too big, so the HVS has already put all composited pixels in the FIFO when you call drmModePageFlip, but actually the FIFO is only ~15000 pixels big on my Pi, so that can't be it

popcornmix commented 2 years ago

I did check with the HVS hardware guy and he confirmed that altering the pointer word in the context memory during a frame won't have any effect. That only gets read at start of frame. The hardware does write the increasing address to the pointer context word at end of each line and reads it at start of next line but there's unlikely to be a safe way of altering that.

Consti10 commented 2 years ago

I did check with the HVS hardware guy and he confirmed that altering the pointer word in the context memory during a frame won't have any effect. That only gets read at start of frame. The hardware does write the increasing address to the pointer context word at end of each line and reads it at start of next line but there's unlikely to be a safe way of altering that.

That is some really valuable information. It means it is impossible or close to impossible to just swap out the buffer that is currently read out by the composer during compose. Aka the only option would literally be to copy the data into the right place. I'l have to test how the rpi performs in this regard. Aka how taxiing a memcopy via cpu of 1080p or 720p video data is. If it is less than 2-3ms for a frame, I'd consider it a feasible option if ultra low latency is of as high priority as for us. If not, the only option is to use a higher refresh rate display. One thing I also haven't figured out is how to update drm with VSYNC if the input video is higher refresh rate than the connected display. Since drmModeSwapPlane blocks the full 16ms, it is currently not possible to just update the framebuffer to use the most recent frame on VSYNC in this case. Aka just before VSYNC, one would probably like to fetch the most recent video frame, and display this one to the user. But with drmModeSwapPlane blocking, that cannot be done easily.

ardera commented 2 years ago

One thing I also haven't figured out is how to update drm with VSYNC if the input video is higher refresh rate than the connected display.

I think most people would just use the lazy solution and just show the first video frame arriving in a given vsync period. However since you want low latency that's a bit more complicated. I think your solution with a timer would be okay (hacky, but there's really no better way), you just need to use something more suited than drmModeSetPlane. Just use drmModePageFlip or drmModeAtomicCommit, with those two you can do non-blocking page flips.

Though, with a bit of hacking, maybe you can do triple buffering on the vc4. Since DRM_MODE_PAGE_FLIP_ASYNC doesn't do non-vsynced updates (as it should), it instead will just replace whatever fb is queued for the next vblank. So you can just, whenever a new video frame arrives, call drmModePageFlip with DRM_MODE_PAGE_FLIP_ASYNC and replace the fb queued for the next vblank. Only problem with that is, when the page flip happens, it's hard to tell which buffer is actually being shown on screen. (And you need to know that because you can't just render into the fb being shown on screen). But maybe you can do a drmModeGetPlane() of the primary plane to find out what fb is being scanned out.

Consti10 commented 2 years ago

So I did some testing regarding memcpy - I think I was able to succesfully mmap the buffer of the decoded frame, and copy the raw data. (still not directly into the fb, have to fgure out how to mmap the drm frame buffer first). Copying a 1080p (1920x1080) SAND frame (3133440 Bytes) takes ~6.2ms on rpi CM4 running at 1.5Ghz.

Assuming allocating one CPU core for copying data around, this would allow for ~ 1080p@160fps with a memcpy only approach. And if I can somehow get the current rasterizer position, one could even optimize the memcpy such that it copies data directly in front of the rasterizer.

ardera commented 2 years ago

You can get the current rasterizer position, this is how the driver does it: https://github.com/raspberrypi/linux/blob/6dc14c0e44ca49f59b5b2cb38b053b08afb37124/drivers/gpu/drm/vc4/vc4_crtc.c#L85

Since you can map physical memory (open /dev/mem and mmap it) you can just do the same thing the driver does. The HVS regs are at 0xFE400000. Or you deduce it from the debug information in /sys/kernel/debug/dri/1, though that might be too slow.

Proof of concept: (needs to be run as root)

```C #include #include #include #include #include #include #include int main(int argc, char **argv) { int fd = open("/dev/mem", O_RDONLY); if (fd < 0) { perror("Couldn't open /dev/mem for reading. open"); return 1; } void *map = mmap(NULL, 0x6000, PROT_READ, MAP_SHARED, fd, 0xFE400000); if (map == MAP_FAILED) { perror("Couldn't mmap physical memory. mmap"); return 1; } volatile uint32_t *regs = map; for (int i = 0; i < 32; i++) { printf("%d: 0x%08" PRIX32 "\n", i, regs[i]); } uint32_t val = regs[18]; // SCALER_DISPSTAT0 int line = val & 0b011111111111; printf("current HVS line: %d\n", line); munmap(map, 0x6000); close(fd); return 0; } ```

Btw, a while ago someone on a discord I'm on was doing the same thing as you (for a different reason) and he said that the memcpy actually took longer if the framebuffer was active (shown on screen)

Consti10 commented 2 years ago

Any idea how to replace drmModeSetPlane() with drmModePageFlip() ?

I've tried the following code: https://github.com/Consti10/hello_drmprime/blob/e0069b38abdc31976e1cf8b80dd255020daa5341/drmprime_out.cpp#L247

but get drmModePageFlip failed: Invalid argument

The code performs drmModeSetPlane() on the first couple of frames for testing, then switches over to drmModePageFlip()

Consti10 commented 2 years ago

I've also implemented (or rather hacked) the most simple method to reduce latency with VSYNC enabled - Ideally, just before VSYNC, the displayThread (if you take the hello_drmprime for examle) is woken up and fetches the most recent decoded video frame. Then updates the crtc via drmModeSetPlane() which should only take almost 0 ms, since the block until the next VSYNC is ~0ms.

To implement that, the easiest way would be 1) call drmModeSetPlane() -> when it returns, we know a VSYNC has just happened and the next one should be in a DISPLAY_REFRESH_RATE intervall 2) sleep close to DISPLAY_REFRESH_RATE
3) Fetch the most recent video frame, drop all (possible) old frames if input video fps > display refresh rate 4) call drmModeSetPlane() which should return almost immediately (since we are just before a VSYNC).

In reality, for a 60hz display, I can (busy) sleep for 12ms after a drmModeSetPlane() before then drmModeSetPlane() starts missing the VSYNC again.

A solution with drmModePageFlip() and ASYNC would have the same effect (and would be even better), but no idea how to replace drmModeSetPlane() with drmModePageFlip().

ardera commented 2 years ago

Any idea how to replace drmModeSetPlane() with drmModePageFlip() ?

I've tried the following code: https://github.com/Consti10/hello_drmprime/blob/e0069b38abdc31976e1cf8b80dd255020daa5341/drmprime_out.cpp#L247

but get drmModePageFlip failed: Invalid argument

Can you try without the DRM_MODE_PAGE_FLIP_EVENT? I'm not 100% sure you can combine it with the DRM_MODE_PAGE_FLIP_EVENT_ASYNC

Btw there are no defined flags for drmModeSetPlane so the DRM_MODE_PAGE_FLIP_ASYNC | DRM_MODE_ATOMIC_NONBLOCK you're giving it as args are completely ignored

Consti10 commented 2 years ago

Any idea how to replace drmModeSetPlane() with drmModePageFlip() ? I've tried the following code: https://github.com/Consti10/hello_drmprime/blob/e0069b38abdc31976e1cf8b80dd255020daa5341/drmprime_out.cpp#L247 but get drmModePageFlip failed: Invalid argument

Can you try without the DRM_MODE_PAGE_FLIP_EVENT? I'm not 100% sure you can combine it with the DRM_MODE_PAGE_FLIP_EVENT_ASYNC

Btw there are no defined flags for drmModeSetPlane so the DRM_MODE_PAGE_FLIP_ASYNC | DRM_MODE_ATOMIC_NONBLOCK you're giving it as args are completely ignored

Yeah, I've tried all the different permutations, it doesn't work. I can't figure out though if it is due to some weird plane != fb incompability or a programming mistake. This stuff is so undocumented.

ardera commented 2 years ago

I've gotten kmscube working with async pageflips. see here

it does an initial drmModeSetCrtc to set the mode and then just

drmModePageFlip(drm.fd, drm.crtc_id, fb->fb_id, DRM_MODE_PAGE_FLIP_ASYNC, &waiting_for_flip);

Consti10 commented 2 years ago

When I memcpy the decoded SAND frames into the currently read out frame buffer, I get the following artifacts:

https://user-images.githubusercontent.com/16462285/161387917-00a9f7ae-9ab2-4a68-985c-a811b54abe07.mp4

This is 720p @ 1 fps, alternating green and red frames.

Since the display is running at 60fps, and the video is running at 1 fps, I'd expect there to be tearing once the frame is copied (~6ms) but then ~59 frames should follow without artifacts.

But these weird artefacts don't seem to follow a pattern, and don't really make much sense. Any ideas how to get rid of that / what the hardware is doing here ?

Screenshot:

Screenshot from 2022-04-02 16-33-41

raspberrypi / linux

drmModeSetPlane takes 20-30ms to execute #3661

Footnotes