vergoh / vnstat

vnStat - a network traffic monitor for Linux and BSD
GNU General Public License v2.0
1.36k stars 120 forks source link

vnstatd 2.9, kernel "NULL pointer dereference" #236

Closed abelbeck closed 1 year ago

abelbeck commented 1 year ago

vnstat is a standard part of our AstLinux Project, cross-compiled from source.

vnstat is a real gem, works perfectly ... except (twice now that I noticed) on a single core Linode (Custom ISO) instance vnstatd causes a kernel "NULL pointer dereference"

The only interface vnstat monitors is one virtio_net device, vnstat only returned errors after Jul 27 06:00:00 (below).

A system reboot was required to fix vnstat.

I have not ever seen this issue anywhere else, be it a VM or bare-metal box.

Basic debugging below:

linode ~ # vnstatd --version
vnStat daemon 2.9 by Teemu Toivola <tst at iki dot fi>

linode ~ # uname -a
Linux linode 4.19.248-astlinux #1 SMP PREEMPT Wed Jul 6 10:46:27 CDT 2022 x86_64 GNU/Linux

linode ~ # lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       40 bits physical, 48 bits virtual
CPU(s):              1
On-line CPU(s) list: 0
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7601 32-Core Processor
Stepping:            2
CPU MHz:             2199.998
BogoMIPS:            4399.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            16384K
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 virt_ssbd arat arch_capabilities

-- kernel snip --
Jul 27 06:00:00 linode user.alert kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
Jul 27 06:00:00 linode user.info kernel: PGD 24745067 P4D 24745067 PUD 246ca067 PMD 0 
Jul 27 06:00:00 linode user.warn kernel: Oops: 0000 [#1] PREEMPT SMP
Jul 27 06:00:00 linode user.warn kernel: CPU: 0 PID: 1380 Comm: vnstatd Not tainted 4.19.248-astlinux #1
Jul 27 06:00:00 linode user.warn kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Jul 27 06:00:00 linode user.warn kernel: RIP: 0010:blk_rq_merge_ok+0xa8/0x128
Jul 27 06:00:00 linode user.warn kernel: Code: 98 00 00 00 0f 85 9d 00 00 00 55 53 48 89 f5 48 89 fb e8 9e dc ff ff 84 c0 74 07 31 c0 e9 82 00 00 00 48 8b 47 30 48 8b 76 68 <48> 39 70 68 75 eb f7 40 10 00 10 00 01 0f 95 c2 81 e1 00 10 00 01
Jul 27 06:00:00 linode user.warn kernel: RSP: 0018:ffffc90000277950 EFLAGS: 00010246
Jul 27 06:00:00 linode user.warn kernel: RAX: 0000000000000000 RBX: ffff88803c437500 RCX: 0000000000000801
Jul 27 06:00:00 linode user.warn kernel: RDX: ffff88803cb3b530 RSI: 0000000000000000 RDI: ffff88803c437500
Jul 27 06:00:00 linode user.warn kernel: RBP: ffff88803b8bd000 R08: 00000000ffffff00 R09: 0000000000000008
Jul 27 06:00:00 linode user.warn kernel: R10: ffffc90000277a78 R11: 0000000000001000 R12: ffff88803b8bd000
Jul 27 06:00:00 linode user.warn kernel: R13: ffffc900002779a8 R14: ffffc90000277d30 R15: ffffc900002779b0
Jul 27 06:00:00 linode user.warn kernel: FS:  00007fd314ea7500(0000) GS:ffff88803fa00000(0000) knlGS:0000000000000000
Jul 27 06:00:00 linode user.warn kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 27 06:00:00 linode user.warn kernel: CR2: 0000000000000068 CR3: 0000000024744000 CR4: 00000000003406f0
Jul 27 06:00:00 linode user.warn kernel: Call Trace:
Jul 27 06:00:00 linode user.warn kernel:  ? blk_attempt_plug_merge+0x7b/0xe5
Jul 27 06:00:00 linode user.warn kernel:  ? blk_mq_make_request+0xb7/0x3b6
Jul 27 06:00:00 linode user.warn kernel:  ? generic_make_request+0x210/0x319
Jul 27 06:00:00 linode user.warn kernel:  ? submit_bio+0xe3/0x11c
Jul 27 06:00:00 linode user.warn kernel:  ? submit_bio+0xe3/0x11c
Jul 27 06:00:00 linode user.warn kernel:  ? guard_bio_eod+0x4f/0x120
Jul 27 06:00:00 linode user.warn kernel:  ? mpage_bio_submit+0x21/0x25
Jul 27 06:00:00 linode user.warn kernel:  ? __mpage_writepage+0x3b0/0x5d5
Jul 27 06:00:00 linode user.warn kernel:  ? radix_tree_next_chunk+0xcf/0x24e
Jul 27 06:00:00 linode user.warn kernel:  ? page_mkclean+0x68/0xa9
Jul 27 06:00:00 linode user.warn kernel:  ? page_referenced_one+0xd1/0xd1
Jul 27 06:00:00 linode user.warn kernel:  ? clear_page_dirty_for_io+0x248/0x275
Jul 27 06:00:00 linode user.warn kernel:  ? write_cache_pages+0x214/0x2d6
Jul 27 06:00:00 linode user.warn kernel:  ? clean_buffers+0x58/0x58
Jul 27 06:00:00 linode user.warn kernel:  ? __generic_write_end+0x36/0x85
Jul 27 06:00:00 linode user.warn kernel:  ? ext2_get_blocks.constprop.3+0x7fd/0x7fd
Jul 27 06:00:00 linode user.warn kernel:  ? mpage_writepages+0x83/0xd7
Jul 27 06:00:00 linode user.warn kernel:  ? ext2_get_blocks.constprop.3+0x7fd/0x7fd
Jul 27 06:00:00 linode user.warn kernel:  ? do_writepages+0x25/0x48
Jul 27 06:00:00 linode user.warn kernel:  ? do_writepages+0x25/0x48
Jul 27 06:00:00 linode user.warn kernel:  ? __filemap_fdatawrite_range+0xcc/0xfa
Jul 27 06:00:00 linode user.warn kernel:  ? file_write_and_wait_range+0x2e/0x5b
Jul 27 06:00:00 linode user.warn kernel:  ? __generic_file_fsync+0x1d/0x86
Jul 27 06:00:00 linode user.warn kernel:  ? generic_file_fsync+0x10/0x2e
Jul 27 06:00:00 linode user.warn kernel:  ? ext2_fsync+0x16/0x3c
Jul 27 06:00:00 linode user.warn kernel:  ? do_fsync+0x29/0x42
Jul 27 06:00:00 linode user.warn kernel:  ? __x64_sys_fdatasync+0xe/0x11
Jul 27 06:00:00 linode user.warn kernel:  ? do_syscall_64+0x155/0x230
Jul 27 06:00:00 linode user.warn kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 27 06:00:00 linode user.warn kernel: Modules linked in: wireguard ip6_udp_tunnel udp_tunnel sch_cake xt_REDIRECT xt_NETMAP nf_log_ipv4 xt_CT xt_helper nf_nat_ftp nf_conntrack_ftp nf_log_ipv6 nf_log_common xt_tcpudp xt_hl ip6t_rt ip6table_nat nf_nat_ipv6 ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_TCPMSS xt_LOG ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip6table_raw iptable_raw ip6table_mangle iptable_mangle xt_multiport xt_state xt_limit xt_conntrack nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ip
Jul 27 06:00:00 linode user.warn kernel: CR2: 0000000000000068
Jul 27 06:00:00 linode user.warn kernel: ---[ end trace 656e6438ebcb302c ]---
Jul 27 06:00:00 linode user.warn kernel: RIP: 0010:blk_rq_merge_ok+0xa8/0x128
Jul 27 06:00:00 linode user.warn kernel: Code: 98 00 00 00 0f 85 9d 00 00 00 55 53 48 89 f5 48 89 fb e8 9e dc ff ff 84 c0 74 07 31 c0 e9 82 00 00 00 48 8b 47 30 48 8b 76 68 <48> 39 70 68 75 eb f7 40 10 00 10 00 01 0f 95 c2 81 e1 00 10 00 01
Jul 27 06:00:00 linode user.warn kernel: RSP: 0018:ffffc90000277950 EFLAGS: 00010246
Jul 27 06:00:00 linode user.warn kernel: RAX: 0000000000000000 RBX: ffff88803c437500 RCX: 0000000000000801
Jul 27 06:00:00 linode user.warn kernel: RDX: ffff88803cb3b530 RSI: 0000000000000000 RDI: ffff88803c437500
Jul 27 06:00:00 linode user.warn kernel: RBP: ffff88803b8bd000 R08: 00000000ffffff00 R09: 0000000000000008
Jul 27 06:00:00 linode user.warn kernel: R10: ffffc90000277a78 R11: 0000000000001000 R12: ffff88803b8bd000
Jul 27 06:00:00 linode user.warn kernel: R13: ffffc900002779a8 R14: ffffc90000277d30 R15: ffffc900002779b0
Jul 27 06:00:00 linode user.warn kernel: FS:  00007fd314ea7500(0000) GS:ffff88803fa00000(0000) knlGS:0000000000000000
Jul 27 06:00:00 linode user.warn kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 27 06:00:00 linode user.warn kernel: CR2: 0000000000000068 CR3: 0000000024744000 CR4: 00000000003406f0
--
vergoh commented 1 year ago

I haven't seen this happen anywhere previously either. Based on the output, I suspect this is some sort of issue in the kernel but I have no idea how to do kernel debugging in this sort of situation. Few questions that may help understand this issue further:

My only guess it that there could by some issue with the virtio_net device kernel module but in that case I would have expected this to be a more common issue. Not sure if the hypervisor Linode provides could cause this sort of fault.

abelbeck commented 1 year ago

Hi @vergoh

how reproducible is this?

Not reproducible, so far it has happened only 2 times, at these times:

Apr 18 23:00:00
- Linux 4.19.236 #1 SMP PREEMPT x86_64 GNU/Linux
- vnStat 2.9
- sqlite3 3.38.2

Jul 27 06:00:00
- Linux 4.19.248 #1 SMP PREEMPT x86_64 GNU/Linux
- vnStat 2.9
- sqlite3 3.39.2

did this start happening after a kernel upgrade?

A minor kernel bump occurred from 4.19.230 to 4.19.236 on Mar 26 2022. No immediate correlation.

what error did vnstat return or was there some error from vnstatd?

Sorry I did not note the error, something like "could not open database" or such.

if this happens again, see what vnstat -D -tr says as that would try to just access the interface statistics in the same way the daemon does, if that works then starting the daemon manually with vnstatd -D -n may provide further clues

OK, thanks, I will try that next time.

My only guess it that there could by some issue with the virtio_net device kernel module but in that case I would have expected this to be a more common issue. Not sure if the hypervisor Linode provides could cause this sort of fault.

FYI, here are the recent virtio_net 4.19.y kernel changes:

2022-03-02 4.19.232 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.19.y&id=e9ffbe63f6f32f526a461756309b61c395168d73 2022-07-02 4.19.250 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.19.y&id=9222672fa6370f0ec3d899662cb8680e9282fc4c 2022-07-07 4.19.251 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.19.y&id=eb12a6398ee3bd33f4400bed756e72c23cc3d7f7

I am now running Linux 4.19.254. I will update if/when I get more data.

BTW, I also test on a single core Vultr instance (KVM similar to Linode, using virtio_net) and have never seen this issue.

abelbeck commented 1 year ago

Testing with:

- Linux 4.19.254 #1 SMP PREEMPT x86_64 GNU/Linux
- vnStat 2.9
- sqlite3 3.39.3

This issue has not occurred again for the last two months.

Closing, presume a fix to the Linux kernel solved the problem.