vpsfreecz / vpsadminos

Host for Linux system containers based on NixOS, ZFS and LXC
https://vpsadminos.org
MIT License
155 stars 26 forks source link

rcu lock up: netfilter & CentOS 8 firewalld #35

Closed snajpa closed 3 years ago

snajpa commented 4 years ago

Symptoms: system gets stuck in rcu lockup after up-to-date firewalld in CentOS 8 is started.

[ 3623.424041] RIP: 0010:memcpy_erms+0x6/0x10
[ 3623.424044] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 3623.424046] RSP: 0018:ffffa455c67abab0 EFLAGS: 00010282 ORIG_RAX: ffffffffffffff13
[ 3623.424048] RAX: ffff8d193dd1f018 RBX: ffff8d193dd1f000 RCX: 0000000000000004
[ 3623.424050] RDX: 0000000000000004 RSI: ffffa455c67abac4 RDI: ffff8d193dd1f018
[ 3623.424051] RBP: ffff8d1956493c00 R08: 0000000000000eac R09: ffff8d193dd1f01c
[ 3623.424052] R10: ffff8d115f406840 R11: ffffa455c67abb68 R12: ffff8d1931340e70
[ 3623.424054] R13: ffff8d1931340000 R14: ffff8d192a7bce00 R15: ffff8d192a7bce00
[ 3623.424059]  nla_put+0x28/0x40
[ 3623.424068]  nf_tables_fill_gen_info.isra.0+0x9c/0x160 [nf_tables]
[ 3623.424075]  nf_tables_getgen+0x51/0xb0 [nf_tables]
[ 3623.424081]  nfnetlink_rcv_msg+0x152/0x260 [nfnetlink]
[ 3623.424086]  ? aa_label_sk_perm.part.0+0x101/0x150
[ 3623.424090]  ? _cond_resched+0x15/0x30
[ 3623.424093]  ? aa_capable+0x10e/0x390
[ 3623.424096]  ? nfnetlink_net_exit_batch+0x60/0x60 [nfnetlink]
[ 3623.424101]  netlink_rcv_skb+0x49/0x110
[ 3623.424104]  nfnetlink_rcv+0x69/0x149 [nfnetlink]
[ 3623.424107]  netlink_unicast+0x191/0x230
[ 3623.424110]  netlink_sendmsg+0x225/0x460
[ 3623.424114]  sock_sendmsg+0x4f/0x60
[ 3623.424116]  __sys_sendto+0xee/0x160
[ 3623.424121]  ? syscall_trace_enter+0x131/0x2c0
[ 3623.424123]  __x64_sys_sendto+0x25/0x30
[ 3623.424126]  do_syscall_64+0x4e/0x160
[ 3623.424129]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

https://paste.vpsfree.cz/FLriePDP/

snajpa commented 4 years ago

Firewalld gets stuck while trying to clean up after itself. strace for SIGHUP:

# strace -f -x -tt -s 1024 firewalld --nofork --debug
[....]
[pid   521] 18:09:54.118008 ioctl(10, FIONBIO, [1]) = 0
[pid   521] 18:09:54.118409 ioctl(10, FIOCLEX) = 0
[pid   521] 18:09:54.118707 ioctl(11, FIONBIO, [1]) = 0
[pid   521] 18:09:54.119021 ioctl(11, FIOCLEX) = 0
[pid   521] 18:09:54.119368 fstat(11, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
[pid   521] 18:09:54.119655 fcntl(11, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid   521] 18:09:54.120059 fstat(10, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
[pid   521] 18:09:54.120370 fcntl(10, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid   521] 18:09:54.120848 write(5, "\x01\x00\x00\x00\x00\x00\x00\x00", 8) = 8
[pid   521] 18:09:54.121260 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=10, events=POLLIN}], 3, 0) = 1 ([{fd=5, revents=POLLIN}])
[pid   521] 18:09:54.121569 read(5, "\x05\x00\x00\x00\x00\x00\x00\x00", 16) = 8
[pid   521] 18:09:54.122047 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=10, events=POLLIN}], 3, 0) = 0 (Timeout)
[pid   521] 18:09:54.122443 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=10, events=POLLIN}], 3, 4991) = 0 (Timeout)
[pid   521] 18:09:59.119370 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=10, events=POLLIN}], 3, -1
) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
[pid   521] 18:10:53.842747 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=563, si_uid=0} ---
[pid   521] 18:10:53.843052 write(9, "\x01\x00\x00\x00\x00\x00\x00\x00", 8) = 8
[pid   552] 18:10:53.843600 <... poll resumed>) = 1 ([{fd=9, revents=POLLIN}])
[pid   521] 18:10:53.844593 rt_sigreturn({mask=[]} <unfinished ...>
[pid   552] 18:10:53.844926 read(9,  <unfinished ...>
[pid   521] 18:10:53.845235 <... rt_sigreturn resumed>) = -1 EINTR (Interrupted system call)
[pid   552] 18:10:53.845453 <... read resumed>"\x01\x00\x00\x00\x00\x00\x00\x00", 16) = 8
[pid   521] 18:10:53.845713 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=10, events=POLLIN}], 3, -1 <unfinished ...>
[pid   552] 18:10:53.846034 write(5, "\x01\x00\x00\x00\x00\x00\x00\x00", 8 <unfinished ...>
[pid   521] 18:10:53.846327 <... poll resumed>) = 1 ([{fd=5, revents=POLLIN}])
[pid   552] 18:10:53.846553 <... write resumed>) = 8
[pid   521] 18:10:53.846812 read(5,  <unfinished ...>
[pid   552] 18:10:53.847080 poll([{fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 2, -1 <unfinished ...>
[pid   521] 18:10:53.847405 <... read resumed>"\x01\x00\x00\x00\x00\x00\x00\x00", 16) = 8
[pid   521] 18:10:53.847941 write(1, "2020-07-01 18:10:53 DEBUG1: reload()", 362020-07-01 18:10:53 DEBUG1: reload()) = 36
[pid   521] 18:10:53.848428 write(1, "\n", 1
) = 1
[pid   521] 18:10:53.848882 write(3, "2020-07-01 18:10:53 DEBUG1: reload()", 36) = 36
[pid   521] 18:10:53.849386 write(3, "\n", 1) = 1
[pid   521] 18:10:53.849873 write(1, "2020-07-01 18:10:53 DEBUG1: Setting policy to 'DROP'", 522020-07-01 18:10:53 DEBUG1: Setting policy to 'DROP') = 52
[pid   521] 18:10:53.850312 write(1, "\n", 1
) = 1
[pid   521] 18:10:53.850742 write(3, "2020-07-01 18:10:53 DEBUG1: Setting policy to 'DROP'", 52) = 52
[pid   521] 18:10:53.851118 write(3, "\n", 1) = 1
[pid   521] 18:10:53.852577 sendto(6, {{len=20, type=NFNL_SUBSYS_NFTABLES<<8|NFT_MSG_GETGEN, flags=NLM_F_REQUEST, seq=2, pid=0}, {nfgen_family=AF_UNSPEC, version=NFNETLINK_V0, res_id=htons(0)}, 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12
snajpa commented 4 years ago

This bug is also present in original CentOS 8 kernel version 4.18.0-193.6.3.el8_2.x86_64 - it's all due to wrong client config, which doesn't make sense - it shouldn't cause kernel lockups though.

snajpa commented 4 years ago

So in the end, a reproducer looks like this:

  1. need to be in user namespace (use LXC/docker/whatever)
  2. firewall-cmd won't allow you to do this, you need to manually edit your /etc/firewalld/zones/public.xml and add:
    <interface name="any_name_does_this" />
  3. restart firewalld
  4. observe your system get stuck
    Message from syslogd@localhost at Jul  8 13:40:26 ...
    kernel:watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [firewalld:3731]
snajpa commented 3 years ago

fixed in 5.9+ kernels