Closed freddrueck closed 1 year ago
I think you want to post this at https://github.com/canonical/lxd instead.
Looks like an Ubuntu kernel bug and indeed, on the userspace side, this is a LXD issue, not an Incus one. Incus is actually working on removing Ubuntu Fan support, per #7
Sorry for the misreporting, while trying to figure out where to report the bug I ended up getting directed here. I'll report the bug over at the Canonical LXD project instead.
I think part of the problem is https://github.com/lxc/lxd now redirects here instead of to canonical's repo.
Required information
Issue description
When using lxd with the an automatically created fan-overlay network, a kernel bug triggers which causes system to almost completely lock up.
A brief description of the problem. Should include what you were attempting to do, what you did, what happened and what you expected to see happen.
Steps to reproduce
Information to attach
output from journalctl:
Aug 17 13:24:48 hv1 kernel: ------------[ cut here ]------------ Aug 17 13:24:48 hv1 kernel: Voluntary context switch within RCU read-side critical section! Aug 17 13:24:48 hv1 kernel: WARNING: CPU: 5 PID: 9611 at kernel/rcu/tree_plugin.h:318 rcu_note_context_switch+0x2a7/0x2f0 Aug 17 13:24:48 hv1 kernel: Modules linked in: veth nft_masq nft_chain_nat vxlan ip6_udp_tunnel udp_tunnel dummy bridge stp llc ebtable_filter ebtables ip6table_raw ip6table_ma> Aug 17 13:24:48 hv1 kernel: mei_me soundcore mei intel_pch_thermal mac_hid acpi_pad sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone ef> Aug 17 13:24:48 hv1 kernel: CPU: 5 PID: 9611 Comm: rsyslogd Tainted: P IO 6.2.0-26-generic #26~22.04.1-Ubuntu Aug 17 13:24:48 hv1 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z170 Extreme6, BIOS P7.50 10/18/2018 Aug 17 13:24:48 hv1 kernel: RIP: 0010:rcu_note_context_switch+0x2a7/0x2f0 Aug 17 13:24:48 hv1 kernel: Code: 08 f0 83 44 24 fc 00 48 89 de 4c 89 f7 e8 61 c4 ff ff e9 1e fe ff ff 48 c7 c7 98 4e 53 9d c6 05 ee b7 3f 02 01 e8 09 1b f3 ff <0f> 0b e9 bd fd> Aug 17 13:24:48 hv1 kernel: RSP: 0018:ffffae450d4df910 EFLAGS: 00010046 Aug 17 13:24:48 hv1 kernel: RAX: 0000000000000000 RBX: ffff9c9336172e40 RCX: 0000000000000000 Aug 17 13:24:48 hv1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Aug 17 13:24:48 hv1 kernel: RBP: ffffae450d4df930 R08: 0000000000000000 R09: 0000000000000000 Aug 17 13:24:48 hv1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 Aug 17 13:24:48 hv1 kernel: R13: ffff9c844e928000 R14: 0000000000000000 R15: 0000000000000000 Aug 17 13:24:48 hv1 kernel: FS: 00007f418098dc40(0000) GS:ffff9c9336140000(0000) knlGS:0000000000000000 Aug 17 13:24:48 hv1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 17 13:24:48 hv1 kernel: CR2: 00007f4180d42000 CR3: 000000017152c002 CR4: 00000000003706e0 Aug 17 13:24:48 hv1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 17 13:24:48 hv1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Aug 17 13:24:48 hv1 kernel: Call Trace: Aug 17 13:24:48 hv1 kernel:
Aug 17 13:24:48 hv1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 17 13:24:48 hv1 kernel: Call Trace:
Aug 17 13:24:48 hv1 kernel:
Aug 17 13:24:48 hv1 kernel: schedule+0xbc/0x5f0
Aug 17 13:24:48 hv1 kernel: schedule+0x68/0x110
Aug 17 13:24:48 hv1 kernel: schedule_hrtimeout_range_clock+0x97/0x130
Aug 17 13:24:48 hv1 kernel: ? pfx_hrtimer_wakeup+0x10/0x10
Aug 17 13:24:48 hv1 kernel: schedule_hrtimeout_range+0x13/0x30
Aug 17 13:24:48 hv1 kernel: do_poll.constprop.0+0x22a/0x3b0
Aug 17 13:24:48 hv1 kernel: do_syspoll+0x166/0x260
Aug 17 13:24:48 hv1 kernel: ? sys_sendmsg+0x95/0xe0
Aug 17 13:24:48 hv1 kernel: ? mod_lruvec_state+0x37/0x50
Aug 17 13:24:48 hv1 kernel: ? mod_lruvec_page_state+0xa0/0x160
Aug 17 13:24:48 hv1 kernel: ? folio_memcg_unlock+0x38/0x80
Aug 17 13:24:48 hv1 kernel: ? unlock_page_memcg+0x18/0x60
Aug 17 13:24:48 hv1 kernel: ? page_add_file_rmap+0x89/0x2b0
Aug 17 13:24:48 hv1 kernel: ? pfx_pollwake+0x10/0x10
Aug 17 13:24:48 hv1 kernel: ? sys_sendmmsg+0x100/0x210
Aug 17 13:24:48 hv1 kernel: ? __secure_computing+0x9b/0x110
Aug 17 13:24:48 hv1 kernel: ? seccomp_filter+0x3df/0x5e0
Aug 17 13:24:48 hv1 kernel: ? pfx_pollwake+0x10/0x10
Aug 17 13:24:48 hv1 kernel: ? __sys_sendmmsg+0x100/0x210
Aug 17 13:24:48 hv1 kernel: ? secure_computing+0x9b/0x110
Aug 17 13:24:48 hv1 kernel: ? __seccomp_filter+0x3df/0x5e0
Aug 17 13:24:48 hv1 kernel: ? syscall_exit_to_user_mode+0x2a/0x50
Aug 17 13:24:48 hv1 kernel: ? ktime_get_ts64+0x52/0x110
Aug 17 13:24:48 hv1 kernel: __x64_sys_poll+0xb5/0x150
Aug 17 13:24:48 hv1 kernel: do_syscall_64+0x59/0x90
Aug 17 13:24:48 hv1 kernel: ? exc_page_fault+0x92/0x1b0
Aug 17 13:24:48 hv1 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Aug 17 13:24:48 hv1 kernel: RIP: 0033:0x7f4180d32d47
Aug 17 13:24:48 hv1 kernel: Code: 00 00 00 5b 49 8b 45 10 5d 41 5c 41 5d 41 5e c3 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 07 00 00 00 0f 05 <48> 3d 00 f0 ff>
Aug 17 13:24:48 hv1 kernel: RSP: 002b:00007ffdc5692788 EFLAGS: 00000246 ORIG_RAX: 0000000000000007
Aug 17 13:24:48 hv1 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4180d32d47
Aug 17 13:24:48 hv1 kernel: RDX: 0000000000001388 RSI: 0000000000000001 RDI: 00007ffdc56928b8
Aug 17 13:24:48 hv1 kernel: RBP: 0000000034c6ac4a R08: 0000000000000005 R09: 0000000000000000
Aug 17 13:24:48 hv1 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
Aug 17 13:24:48 hv1 kernel: R13: 00007ffdc5692880 R14: 00007ffdc56928b8 R15: 00007f4180e3c340
Aug 17 13:24:48 hv1 kernel:
Aug 17 13:24:48 hv1 kernel: ---[ end trace 0000000000000000 ]---
output of
inc monitor
while reproducing the issue)Here is how I ran lxd init:
root@hv1:/snap/lxd/25112# lxd init Would you like to use LXD clustering? (yes/no) [default=no]: yes What IP address or DNS name should be used to reach this server? [default=192.168.3.1]: Are you joining an existing cluster? (yes/no) [default=no]: What member name should be used to identify this server in the cluster? [default=hv1]: Do you want to configure a new local storage pool? (yes/no) [default=yes]: Name of the storage backend to use (btrfs, dir, lvm, zfs) [default=zfs]: Would you like to create a new zfs dataset under rpool/lxd? (yes/no) [default=yes]: Do you want to configure a new remote storage pool? (yes/no) [default=no]: Would you like to connect to a MAAS server? (yes/no) [default=no]: Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]: Would you like to create a new Fan overlay network? (yes/no) [default=yes]: What subnet should be used as the Fan underlay? [default=auto]: Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: yes config: core.https_address: 192.168.3.1:8443 networks:
It's hard to be precise about exactly when the bug triggers. Just running lxd init above will not trigger the bug if there are no containers running. If I bring up a container with no network connection, the bug also does not trigger. However if I have at least one container running with an active network connection, the bug seems to trigger reliably. Within a few minutes the system will usually be so unstable it is barely usable, though at other times the system remains somewhat usable for at least 20 minutes (if not more).
What does seem to reliably happen is that the system will not cleanly reboot. I can only get it to reboot using the "sysrec magic key" , presumably this is related to this kernel message:
Aug 17 16:36:21 hv1 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P5846 } 243611 jiffies s: 801 root: 0x0/T Aug 17 16:36:21 hv1 kernel: rcu: blocking rcu_node structures (internal RCU debug):
presumably the stalled tasks cannot be terminated, thus the system will not reboot.