umlaeute / v4l2loopback

v4l2-loopback device
GNU General Public License v2.0
3.61k stars 515 forks source link

kernel panic: list_del corruption. next->prev should be A, but was B #532

Closed mzihlmann closed 1 year ago

mzihlmann commented 1 year ago

Environment

note: we tried several different versions, debian package, 0.12.7 release and the above commit.

Step 3: Describe the problem:

I have not yet narrowed down to an easily reproducible setup, but i can explain what we try to do. We have a camera based android app that we try to end-to-end test. For that purpose we run an android emulator and use v4l2loopback and gstreamer to make the emulator recognize /dev/video0 as camera input. Then we use pytest and selenium to feed an image into the loopback device and check whether the app correctly reacts on the application level.

These end-to-end tests run in our CI environment. We use gitlab-runner on top of kubernetes and funnel the devices through to the containers using generic-device-plugin. This way each container sees a device /dev/video0, but they are mapped onto different virtual devices in the underlying host system. The driver is installed on the host system directly.

Now the problem is that from time-to-time (over weeks) we got a kernel panic from the driver resulting in the host server to reboot. I have now a setup where i can reproduce this in roughly 20mins intervals but I'm not sure how to proceed to get more useful information out.

Observed Results:

kernel panic on the underlying system, causing reboot of the kubernetes host.

Apr  3 15:22:57 ci5 kernel: [ 3998.077452] v4l2-loopback[1063], pid(159437):  s_fmt_out(0) 0...2764800
Apr  3 15:22:57 ci5 kernel: [ 3998.077453] v4l2-loopback[1067], pid(159437):  outFOURCC=RGB3
Apr  3 15:22:57 ci5 kernel: [ 3998.077455] v4l2-loopback[2177], pid(159437):  allocating 5529600 = 2764800x2
Apr  3 15:22:57 ci5 kernel: [ 3998.078278] v4l2-loopback[2191], pid(159437):  vmallocated 5529600 bytes
Apr  3 15:22:57 ci5 kernel: [ 3998.078417] v4l2-loopback[1383], pid(159437):  reqbufs: 1    0=2
Apr  3 15:22:57 ci5 kernel: [ 3998.078422] v4l2-loopback[1383], pid(159437):  reqbufs: 2    0=2
Apr  3 15:22:57 ci5 kernel: [ 3998.078424] v4l2-loopback[1383], pid(159437):  reqbufs: 4    0=2
Apr  3 15:22:57 ci5 kernel: [ 3998.079547] v4l2-loopback[1383], pid(159437):  reqbufs: 1    2=2
Apr  3 15:22:57 ci5 kernel: [ 3998.080805] v4l2-loopback[2161], pid(159437):  allocating buffers again: 5529600 5529600
Apr  3 15:22:58 ci5 kernel: [ 3998.931060] v4l2-loopback[1988], pid(159522):  opened dev:000000007768ab49 with image:00000000e403a550
Apr  3 15:23:01 ci5 kernel: [ 4001.471787] v4l2-loopback[1988], pid(159786):  opened dev:000000007768ab49 with image:00000000e403a550
Apr  3 15:23:03 ci5 kernel: [ 4003.208616] list_del corruption. next->prev should be ffff94d52dddc290, but was ffff94d52dddc220
Apr  3 15:23:03 ci5 kernel: [ 4003.208750] ------------[ cut here ]------------
Apr  3 15:23:03 ci5 kernel: [ 4003.208752] kernel BUG at lib/list_debug.c:62!
Apr  3 15:23:03 ci5 kernel: [ 4003.208856] invalid opcode: 0000 [#1] SMP NOPTI
Apr  3 15:23:03 ci5 kernel: [ 4003.208956] CPU: 5 PID: 156332 Comm: imagefreeze0:sr Tainted: P           OE     5.10.0-21-amd64 #1 Debian 5.10.162-1
Apr  3 15:23:03 ci5 kernel: [ 4003.209083] Hardware name: Thomas-Krenn.AG Super Server/X11SPL-F, BIOS 3.1 05/21/2019
Apr  3 15:23:03 ci5 kernel: [ 4003.209215] RIP: 0010:__list_del_entry_valid.cold+0x1d/0x69
Apr  3 15:23:03 ci5 kernel: [ 4003.209315] Code: c7 c7 58 0d d2 99 e8 df 0f ff ff 0f 0b 48 89 fe 48 c7 c7 68 0e d2 99 e8 ce 0f ff ff 0f 0b 48 c7 c7 78 0f d2 99 e8 c0 0f ff ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 38 0f d2 99 e8 ac 0f ff ff 0f 0b
Apr  3 15:23:03 ci5 kernel: [ 4003.209460] RSP: 0018:ffffaa73ab103d38 EFLAGS: 00010246
Apr  3 15:23:03 ci5 kernel: [ 4003.209571] RAX: 0000000000000054 RBX: ffffaa73ab103e50 RCX: 0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.209683] RDX: 0000000000000000 RSI: ffff94e43fd60900 RDI: ffff94e43fd60900
Apr  3 15:23:03 ci5 kernel: [ 4003.209793] RBP: ffff94d52dddc000 R08: 0000000000000000 R09: ffffaa73ab103b60
Apr  3 15:23:03 ci5 kernel: [ 4003.209907] R10: ffffaa73ab103b58 R11: ffff94e47ff5b600 R12: ffff94d52dddc290
Apr  3 15:23:03 ci5 kernel: [ 4003.210018] R13: 0000000000000000 R14: ffff94d5014295e0 R15: ffff94d52dddcfd8
Apr  3 15:23:03 ci5 kernel: [ 4003.210133] FS:  00007ff6ae542700(0000) GS:ffff94e43fd40000(0000) knlGS:0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.210259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  3 15:23:03 ci5 kernel: [ 4003.210364] CR2: 0000000000000000 CR3: 00000002be1ba003 CR4: 00000000007726e0
Apr  3 15:23:03 ci5 kernel: [ 4003.210478] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.210589] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr  3 15:23:03 ci5 kernel: [ 4003.210703] PKRU: 55555554
Apr  3 15:23:03 ci5 kernel: [ 4003.210807] Call Trace:
Apr  3 15:23:03 ci5 kernel: [ 4003.210919]  vidioc_dqbuf+0x82/0x2d0 [v4l2loopback]
Apr  3 15:23:03 ci5 kernel: [ 4003.211040]  __video_do_ioctl+0x39f/0x3e0 [videodev]
Apr  3 15:23:03 ci5 kernel: [ 4003.211162]  video_usercopy+0x18d/0x610 [videodev]
Apr  3 15:23:03 ci5 kernel: [ 4003.211272]  ? v4l_print_control+0x20/0x20 [videodev]
Apr  3 15:23:03 ci5 kernel: [ 4003.211387]  v4l2_ioctl+0x46/0x50 [videodev]
Apr  3 15:23:03 ci5 kernel: [ 4003.211495]  __x64_sys_ioctl+0x88/0xc0
Apr  3 15:23:03 ci5 kernel: [ 4003.211608]  do_syscall_64+0x30/0x80
Apr  3 15:23:03 ci5 kernel: [ 4003.211716]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
Apr  3 15:23:03 ci5 kernel: [ 4003.211821] RIP: 0033:0x7ff6b03e0cc7
Apr  3 15:23:03 ci5 kernel: [ 4003.211925] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
Apr  3 15:23:03 ci5 kernel: [ 4003.212073] RSP: 002b:00007ff6ae541638 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Apr  3 15:23:03 ci5 kernel: [ 4003.212193] RAX: ffffffffffffffda RBX: 00007ff6b07a3b40 RCX: 00007ff6b03e0cc7
Apr  3 15:23:03 ci5 kernel: [ 4003.212308] RDX: 00007ff6ae541650 RSI: 00000000c0585611 RDI: 0000000000000006
Apr  3 15:23:03 ci5 kernel: [ 4003.212419] RBP: 00007ff6ae541930 R08: 0000000000000001 R09: 0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.212533] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.212644] R13: 000055c4f376abe0 R14: 000055c4f3776fe0 R15: 00007ff6a4009030
Apr  3 15:23:03 ci5 kernel: [ 4003.212762] Modules linked in: v4l2loopback(OE) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc fscache nvidia_uvm(POE) xt_tcpudp xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 xt_nat xt_CT cls_bpf algif_hash af_alg sch_ingress vxlan ip6_udp_tunnel udp_tunnel veth xfrm_user xfrm_algo xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_filter iptable_raw iptable_mangle iptable_nat nf_conntrack_netlink xt_addrtype xt_set ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set xt_MASQUERADE nft_chain_nat nf_nat xt_conntrack xt_comment nft_counter xt_mark nft_compat nf_tables nfnetlink overlay dm_crypt videodev mc intel_rapl_msr intel_rapl_common ib_iser rdma_cm iw_cm ib_cm skx_edac nfit ib_core libnvdimm ipmi_ssif iscsi_tcp x86_pkg_temp_thermal intel_powerclamp libiscsi_tcp libiscsi coretemp scsi_transport_iscsi snd_hda_codec_hdmi nvidia_drm(POE) snd_hda_intel kvm_intel
Apr  3 15:23:03 ci5 kernel: [ 4003.212832]  snd_intel_dspcfg drm_kms_helper soundwire_intel kvm soundwire_generic_allocation cec irqbypass snd_soc_core nvidia_modeset(POE) rapl intel_cstate snd_compress soundwire_cadence snd_hda_codec snd_hda_core snd_hwdep soundwire_bus snd_pcm snd_timer snd mei_me iTCO_wdt acpi_ipmi intel_pmc_bxt iTCO_vendor_support intel_uncore soundcore joydev pcspkr evdev mei nvidia(POE) sg watchdog ipmi_si ioatdma ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter button br_netfilter bridge stp llc ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs drm sunrpc nf_conntrack nf_defrag_ipv6 fuse nf_defrag_ipv4 configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c crc32c_generic raid0 multipath linear raid1 md_mod sd_mod dm_mod crc32_pclmul crc32c_intel ghash_clmulni_intel xhci_pci ahci nvme xhci_hcd libahci libata aesni_intel atlantic igb usbcore libaes crypto_simd cryptd nvme_core
Apr  3 15:23:03 ci5 kernel: [ 4003.213256]  glue_helper scsi_mod macsec t10_pi i2c_algo_bit crc_t10dif i2c_i801 dca crct10dif_generic ptp crct10dif_pclmul lpc_ich i2c_smbus pps_core crct10dif_common usb_common wmi [last unloaded: v4l2loopback]
Apr  3 15:23:03 ci5 kernel: [ 4003.213812] ---[ end trace 0231da2beb57b572 ]---

Message from syslogd@ci5 at Apr  3 15:23:03 ...
 kernel:[ 4003.292382] Kernel panic - not syncing: Fatal exception
Apr  3 15:23:03 ci5 kernel: [ 4003.290789] RIP: 0010:__list_del_entry_valid.cold+0x1d/0x69
Apr  3 15:23:03 ci5 kernel: [ 4003.290901] Code: c7 c7 58 0d d2 99 e8 df 0f ff ff 0f 0b 48 89 fe 48 c7 c7 68 0e d2 99 e8 ce 0f ff ff 0f 0b 48 c7 c7 78 0f d2 99 e8 c0 0f ff ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 38 0f d2 99 e8 ac 0f ff ff 0f 0b
Apr  3 15:23:03 ci5 kernel: [ 4003.291041] RSP: 0018:ffffaa73ab103d38 EFLAGS: 00010246
Apr  3 15:23:03 ci5 kernel: [ 4003.291152] RAX: 0000000000000054 RBX: ffffaa73ab103e50 RCX: 0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.291261] RDX: 0000000000000000 RSI: ffff94e43fd60900 RDI: ffff94e43fd60900
Apr  3 15:23:03 ci5 kernel: [ 4003.291373] RBP: ffff94d52dddc000 R08: 0000000000000000 R09: ffffaa73ab103b60
Apr  3 15:23:03 ci5 kernel: [ 4003.291484] R10: ffffaa73ab103b58 R11: ffff94e47ff5b600 R12: ffff94d52dddc290
Apr  3 15:23:03 ci5 kernel: [ 4003.291585] R13: 0000000000000000 R14: ffff94d5014295e0 R15: ffff94d52dddcfd8
Apr  3 15:23:03 ci5 kernel: [ 4003.291699] FS:  00007ff6ae542700(0000) GS:ffff94e43fd40000(0000) knlGS:0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.291825] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  3 15:23:03 ci5 kernel: [ 4003.291937] CR2: 0000000000000000 CR3: 00000002be1ba003 CR4: 00000000007726e0
Apr  3 15:23:03 ci5 kernel: [ 4003.292040] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr  3 15:23:03 ci5 kernel: [ 4003.292152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr  3 15:23:03 ci5 kernel: [ 4003.292263] PKRU: 55555554
Apr  3 15:23:03 ci5 kernel: [ 4003.292382] Kernel panic - not syncing: Fatal exception

Expected Results:

no kernel panic

Relevant Code:

ill try to narrow it down further, currently its a quite large project with a lot of moving parts (its an e2e test after all :wink:). in case i can create a minimal setup ill post it here.

mzihlmann commented 1 year ago

not sure, shouldn't all calls that meddle with the queue be locked?

buffer_written, called from vidioc_qbuf (enqueue), uses spin_lock_bh, as i understand this disables interrupts and acquires the mutex lock, s.t. it is safe to use in an interrupt context.

vidioc_dqbuf (dequeue) directly modifies the list without bothering for mutex lock.

is that intended behaviour?

mzihlmann commented 1 year ago

hmmm digging into history it looks like the lock was never intended to lock the list, instead it was intended to lock the read stats if multiple consumers are present. hence its only applied to the reading. there is no protection of the buffer list in the design, but linux's list is not threadsafe. that sounds like a possible root-cause for the behaviour above?

i'm leaning a bit out of a window here. after i applied a mutex lock to the dequeueing as well, so far i have not yet seen kernel panic :crossed_fingers: . but to positively test it i would need to fix all possible concurrency issues.... :grimacing:

mzihlmann commented 1 year ago

just fyi:

this is the change i applied here

    case V4L2_BUF_TYPE_VIDEO_OUTPUT:
+       spin_lock_bh(&dev->lock);
        b = list_entry(dev->outbufs_list.prev, struct v4l2l_buffer,
                   list_head);
        list_move_tail(&b->list_head, &dev->outbufs_list);
+       spin_unlock_bh(&dev->lock);
        dprintkrw("output DQBUF index: %d\n", b->buffer.index);
        unset_flags(b);
        *buf = b->buffer;
        buf->type = V4L2_BUF_TYPE_VIDEO_OUTPUT;

but i think we should maybe spend a second mutex lock to clearly distinguish locking the stats and locking the list.

mzihlmann commented 1 year ago

apparently this fixes the kernel panic by simply freezing the output :facepalm: ill try further to get it working properly

mzihlmann commented 1 year ago

nope, turns out this fix should be fine, output is down to a single frame not because of this change, but because of some change in v0.12.7...HEAD. i did not further investigate yet which change causes this other issue, instead i rebased the fix proposed above onto v0.12.7 instead. This now works fine for me.

umlaeute commented 1 year ago

thanks for investigating.

i agree that re-using the stats mutex just because it's already there, is not a good idea though...

umlaeute commented 1 year ago

fixed with d9277249064a04cee6e3784573cf36162e09bf3e & 6b9ac6ca106eed2ee01fb2b66960c1ea10a10f40