opnsense / src

OPNsense operating system on top of FreeBSD
https://opnsense.org/
Other
356 stars 151 forks source link

unexplained kernel panic #179

Closed lelemka0 closed 5 months ago

lelemka0 commented 1 year ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

I get kernel panics from time to time, I'm not sure what's causing this problem, it's been happening since I upgraded opnsense from 23.1.7_3 to 23.1.9. 2-3 times a day, after which the system will automatically restart, I didn't modify any configuration.

To Reproduce

In my case, after the upgrade, I'm not sure how to reproduce.

Expected behavior

no kernel panic

Relevant log files

2023-06-16T00:23:11 | Notice | kernel | ---<<BOOT>>--- |  
-- | -- | -- | -- | --
2023-06-16T00:23:11 | Notice | kernel | KDB: enter: panic |  
2023-06-16T00:23:11 | Notice | kernel | mi_startup() at mi_startup+0xdf/frame 0x24dc000 |  
2023-06-16T00:23:11 | Notice | kernel | --- trap 0x80d09200, rip = 0xffffffff80c311df, rsp = 0, rbp = 0x24dc000 --- |  
2023-06-16T00:23:11 | Notice | kernel | fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0084252f30 |  
2023-06-16T00:23:11 | Notice | kernel | fork_exit() at fork_exit+0x7e/frame 0xfffffe0084252f30 |  
2023-06-16T00:23:11 | Notice | kernel | ithread_loop() at ithread_loop+0x25a/frame 0xfffffe0084252ef0 |  
2023-06-16T00:23:11 | Notice | kernel | vtnet_rx_vq_process() at vtnet_rx_vq_process+0xb7/frame 0xfffffe0084252e60 |  
2023-06-16T00:23:11 | Notice | kernel | vtnet_rxq_eof() at vtnet_rxq_eof+0x73e/frame 0xfffffe0084252e20 |  
2023-06-16T00:23:11 | Notice | kernel | ether_input() at ether_input+0x69/frame 0xfffffe0084252d60 |  
2023-06-16T00:23:11 | Notice | kernel | netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe0084252d00 |  
2023-06-16T00:23:11 | Notice | kernel | ether_nh_input() at ether_nh_input+0x35a/frame 0xfffffe0084252cb0 |  
2023-06-16T00:23:11 | Notice | kernel | ether_demux() at ether_demux+0x121/frame 0xfffffe0084252c50 |  
2023-06-16T00:23:11 | Notice | kernel | ether_input() at ether_input+0x69/frame 0xfffffe0084252c20 |
2023-06-16T00:23:11 | Notice | kernel | netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe0084252bc0 |  
2023-06-16T00:23:11 | Notice | kernel | ether_nh_input() at ether_nh_input+0x1f1/frame 0xfffffe0084252b70 |  
2023-06-16T00:23:11 | Notice | kernel | ng_ether_input() at ng_ether_input+0x4c/frame 0xfffffe0084252b10 |  
2023-06-16T00:23:11 | Notice | kernel | ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe0084252ae0 |  
2023-06-16T00:23:11 | Notice | kernel | ng_apply_item() at ng_apply_item+0x2bd/frame 0xfffffe0084252aa0 |  
2023-06-16T00:23:11 | Notice | kernel | ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe0084252a00 |  
2023-06-16T00:23:11 | Notice | kernel | ng_apply_item() at ng_apply_item+0x2bd/frame 0xfffffe00842529c0 |  
2023-06-16T00:23:11 | Notice | kernel | ng_ether_rcv_upper() at ng_ether_rcv_upper+0x8c/frame 0xfffffe0084252920 |  
2023-06-16T00:23:11 | Notice | kernel | ether_demux() at ether_demux+0x138/frame 0xfffffe0084252900 |  
2023-06-16T00:23:11 | Notice | kernel | netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00842528d0 |  
2023-06-16T00:23:11 | Notice | kernel | ip6_input() at ip6_input+0x60f/frame 0xfffffe0084252880 |  
2023-06-16T00:23:11 | Notice | kernel | ip6_tryforward() at ip6_tryforward+0x2ce/frame 0xfffffe00842527a0 |  
2023-06-16T00:23:11 | Notice | kernel | pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe0084252720 |  
2023-06-16T00:23:11 | Notice | kernel | pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe00842526e0 |  
2023-06-16T00:23:11 | Notice | kernel | pf_test6() at pf_test6+0xfdb/frame 0xfffffe00842526b0 |  
2023-06-16T00:23:11 | Notice | kernel | pf_refragment6() at pf_refragment6+0x14f/frame 0xfffffe0084252540 |  
2023-06-16T00:23:11 | Notice | kernel | ip6_forward() at ip6_forward+0x62d/frame 0xfffffe00842524f0 |  
2023-06-16T00:23:11 | Notice | kernel | --- trap 0xc, rip = 0xffffffff80eb8dcd, rsp = 0xfffffe00842523d0, rbp = 0xfffffe00842524f0 --- |  
2023-06-16T00:23:11 | Notice | kernel | calltrap() at calltrap+0x8/frame 0xfffffe0084252300 |  
2023-06-16T00:23:11 | Notice | kernel | trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0084252300 |  
2023-06-16T00:23:11 | Notice | kernel | trap_fatal() at trap_fatal+0x385/frame 0xfffffe00842522a0 |  
2023-06-16T00:23:11 | Notice | kernel | panic() at panic+0x43/frame 0xfffffe0084252240 |  
2023-06-16T00:23:11 | Notice | kernel | vpanic() at vpanic+0x17f/frame 0xfffffe00842521e0 |  
2023-06-16T00:23:11 | Notice | kernel | db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0084252190 |  
2023-06-16T00:23:11 | Notice | kernel | KDB: stack backtrace: |  
2023-06-16T00:23:11 | Notice | kernel | time = 1686874897 |  
2023-06-16T00:23:11 | Notice | kernel | cpuid = 2 |  
2023-06-16T00:23:11 | Notice | kernel | panic: page fault
2023-06-16T00:23:11 | Notice | kernel | trap number = 12 |  
2023-06-16T00:23:11 | Notice | kernel | current process = 12 (irq29: virtio_pci2) |  
2023-06-16T00:23:11 | Notice | kernel | processor eflags = interrupt enabled, resume, IOPL = 0 |  
2023-06-16T00:23:11 | Notice | kernel | = DPL 0, pres 1, long 1, def32 0, gran 1 |  
2023-06-16T00:23:11 | Notice | kernel | code segment = base 0x0, limit 0xfffff, type 0x1b |  
2023-06-16T00:23:11 | Notice | kernel | frame pointer = 0x28:0xfffffe00842524f0 |  
2023-06-16T00:23:11 | Notice | kernel | stack pointer = 0x28:0xfffffe00842523d0 |  
2023-06-16T00:23:11 | Notice | kernel | instruction pointer = 0x20:0xffffffff80eb8dcd |  
2023-06-16T00:23:11 | Notice | kernel | fault code = supervisor read data, page not present |  
2023-06-16T00:23:11 | Notice | kernel | fault virtual address = 0x10 |  
2023-06-16T00:23:11 | Notice | kernel | cpuid = 2; apic id = 02 |  
2023-06-16T00:23:11 | Notice | kernel | Fatal trap 12: page fault while in kernel mode |

Environment

OPNsense 23.1.9 - amd64, OpenSSL over Proxmox VE (i440fx cpu: host) Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz (4 cores, 4 threads) Network VirtIO

fichtner commented 1 year ago

Hi,

Looks like an issue in the vtnet(1) driver during packet receive. Not much else can be said here for now. Perhaps https://bugs.freebsd.org has some clues about this...

Cheers, Franco

lelemka0 commented 1 year ago

This problem keeps coming up and I've been looking at and trying it out more over the past while. I found that this problem only occurred when using two wan ports, and it ran stably for a long time no matter which one I turned off.

Before each panic, there is a large amount of repeated similar content in the log, as follows: <7>cannot forward src fe80:3::1, dst <my vtnet2 port's expired ipv6 address>, nxt 58, rcvif vtnet2, outif pppoe0 Among them, vtnet2 and pppoe0 are two wan ports respectively.

I tried capturing packets on the vtnet2 interface, but I never found the packet with the source address fe80:3::1. According to nxt 58 and time, I think this packet is icmpv6 type135, Neibhbor Solicitation from ISP device with local-link address fe80::1, the requested destination address is the last expired ipv6 address on vtnet2. It appears that since the address does not exist on the router, the request is forwarded to the default gateway and fails.

I see that the firewall rules allow ICMPv6 Type 135,136 from and to all interfaces by default, but I can't do anything about it. This may not be a driver issue, could you take a look at it for me, thanks very much.

lelemka0 commented 1 year ago

Latest log:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x10
fault code      = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ea3b9c
stack pointer           = 0x28:0xfffffe008fa183f0
frame pointer           = 0x28:0xfffffe008fa18510
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process     = 12 (irq30: virtio_pci2)
trap number     = 12
panic: page fault
cpuid = 1
time = 1694263139
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008fa181b0
vpanic() at vpanic+0x151/frame 0xfffffe008fa18200
panic() at panic+0x43/frame 0xfffffe008fa18260
trap_fatal() at trap_fatal+0x387/frame 0xfffffe008fa182c0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008fa18320
calltrap() at calltrap+0x8/frame 0xfffffe008fa18320
--- trap 0xc, rip = 0xffffffff80ea3b9c, rsp = 0xfffffe008fa183f0, rbp = 0xfffffe008fa18510 ---
ip6_forward() at ip6_forward+0x60c/frame 0xfffffe008fa18510
pf_refragment6() at pf_refragment6+0x14f/frame 0xfffffe008fa18560
pf_test6() at pf_test6+0xfdf/frame 0xfffffe008fa186d0
pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe008fa18700
pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe008fa18740
ip6_tryforward() at ip6_tryforward+0x2ce/frame 0xfffffe008fa187c0
ip6_input() at ip6_input+0x5e4/frame 0xfffffe008fa188a0
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe008fa188f0
ether_demux() at ether_demux+0x159/frame 0xfffffe008fa18920
ng_ether_rcv_upper() at ng_ether_rcv_upper+0x8c/frame 0xfffffe008fa18940
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe008fa189d0
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe008fa18a10
ng_apply_item() at ng_apply_item+0x2bf/frame 0xfffffe008fa18aa0
ng_snd_item() at ng_snd_item+0x28e/frame 0xfffffe008fa18ae0
ng_ether_input() at ng_ether_input+0x4c/frame 0xfffffe008fa18b10
ether_nh_input() at ether_nh_input+0x1f2/frame 0xfffffe008fa18b70
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe008fa18bc0
ether_input() at ether_input+0x69/frame 0xfffffe008fa18c20
ether_demux() at ether_demux+0xa0/frame 0xfffffe008fa18c50
ether_nh_input() at ether_nh_input+0x36b/frame 0xfffffe008fa18cb0
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe008fa18d00
ether_input() at ether_input+0x69/frame 0xfffffe008fa18d60
vtnet_rxq_eof() at vtnet_rxq_eof+0x80/frame 0xfffffe008fa18e20
vtnet_rx_vq_process() at vtnet_rx_vq_process+0xb7/frame 0xfffffe008fa18e60
ithread_loop() at ithread_loop+0x25a/frame 0xfffffe008fa18ef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe008fa18f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe008fa18f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Uptime: 4h6m29s
---<<BOOT>>---

Version:

OPNsense 23.7.3-amd64
FreeBSD 13.2-RELEASE-p2
OpenSSL 1.1.1v 1 Aug 2023

Maybe it’s the same problem as #184

fichtner commented 1 year ago

Different bug. I have the suspicion this is something inherently broken in FreeBSD 13 and nobody bothers to fix it there. We tried to apply a bandaid but it may not be working for everyone, see https://github.com/opnsense/src/commit/8bf1ae0b1e5987fc07743928cf3aa0d501439d37

fichtner commented 1 year ago

I meant to post https://github.com/opnsense/src/commit/fe901c3661ea71f6aa688098184e07fd3a0d85bd but looking at it for you ip6_forward fails which what was fixed because the new path is ip6_output. Strange.

fichtner commented 5 months ago

No response after posting a probable fix...

lelemka0 commented 5 months ago

In the latest version (OPNsense 24.1.4-amd64, FreeBSD 13.2-RELEASE-p10, OpenSSL 3.0.13), this situation still occurs from time to time. I have observed that the occurrence of panic seems to have a regular time interval, with an interval of about 4 hours, and is not affected by active restarts. Panic always appears around these time points: 0:10, 4:10, 8:10, 12:10, 16:10, 20:10.

I'm not sure why this is, but as a temporary solution, I turned off ipv6 completely (configure ipv6 on all interfaces to None) then the panic no longer occurs.

fgtfv567 commented 1 month ago

I'm not sure why this is, but as a temporary solution, I turned off ipv6 completely (configure ipv6 on all interfaces to None) then the panic no longer occurs.

Having a very similar problem to you, how specifically did you turn ipv6 off completely?

fichtner commented 1 month ago

I'm relatively sure the problem is gone from 24.7, if not 24.1.x too.

fgtfv567 commented 1 month ago

I'm relatively sure the problem is gone from 24.7, if not 24.1.x too.

I just updated my OPNSense installation last night and it just crashed on me an hour ago. So for me the problem is definitely not fixed. If you want to look through my crash reports, I've been sending them in immediately after I know about them. Email is Fgtfv567@gmail.com. Any insight into my problem would be welcomed.

lelemka0 commented 1 month ago

how specifically did you turn ipv6 off completely?

Set the ipv6 configuration type of all interfaces to None. Actually, for me, opnsense regularly crashes were caused by dnscrypt-proxy running latency tests for ipv6 upstream server. I realized this by chance. Since I disabled ipv6 dns server in dnscrypt-proxy, kernel panic never happened again, so I believe the root cause lies in large and frequent ipv6 icmp. This problem exists in 24.1 indeed, but I have not tried it in 24.7.

TheOfficialMrBlah commented 4 days ago

Same problem for me. In my case, the kernel panic almost always occurs after a CronJob "Periodic interface reset".

OPNSense version: OPNsense 24.7.4_1-amd64 FreeBSD 14.1-RELEASE-p4

But I also had this error with the previous versions.