opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.37k stars 757 forks source link

Page faults, panic, (kernel) and system instability #7042

Closed ghost closed 5 months ago

ghost commented 11 months ago

Bug Description

The bug presents itself in several stability issues and annoyances I have experienced. I work from home so I am usually immediately aware when networking issues come up. It happens frequently enough that I started grabbing screenshots of the logs.

  1. Unexpected reboots
  2. VLANs randomly have no connectivity. I have a management, IoT, trusted, and untrusted VLANS and hosts on those networks become unreachable on selective VLANs.
  3. Sometimes I find the router in a state where no traffic gets routed (all VLANs are unresponsive) and I cannot access the management GUI for the box. A reboot typically resolves this issue until issues 1 or 2 creep up.

The last time I had system stability was 23.1 Quintessential Quail. Though, I would sometimes have bug 3 (above) and have to reboot. But the frequency of that was a few times a year, not a few times a week like it is now.

To Reproduce I only really use VLANs, and firewall rules between them, and a DHCP server for all the VLANs. These issues persist on the current release of OPNsense 23.7.9-amd64

My upgrade change log was as follows:

23.1.7 (?) -> 23.7.8 (via GUI) -> 23.7.8 (fresh install via bootable media, carried over config) -> 23.7.9 (current, via GUI)

Expected behavior

Using only basic firewall and routing functionality, I expect stability that does not warrant several weekly reboots.

Alternatives I considered

I have performed a clean install using bootable media to the latest (at the time) release. Then I updated to the current 23.7.9 release. This issues started happening when I went from 23.1.x -> 23.7.x so I comfortable ruling out a hardware issue. Also, probing all my managed switches, they appear to be stable and network issues resolve after reboot of OpnSense so I am comfortable ruling out networking issues.

Screenshots

Here are 2 screenshots of the logs when the router randomly rebooted. Reads bottom (oldest) to top (most recent). The following severity levels were selected (Emergency, alert, critical, error, warning, notice, debug)

unexpected_reboot_full

Here are 2 screenshots of logs of the system not routing traffic. Reads bottom (oldest) to top (most recent). The following severity levels were selected (Emergency, alert, critical, error, warning, notice, debug)

fault_and_panic_full

Relevant log files

Screenshot of logs provided, can provide more comprehensive logs if instructed to which/how.

Environment, software, and hardware

OpnSense Version:

  OPNsense 23.7.9-amd64
  FreeBSD 13.2-RELEASE-p5
  OpenSSL 1.1.1w

I run OpnSense on a 2 year old Protectli Vault :

  Intel Celeron® J4125 Quad Core at 2 GHz (Burst up to 2.7 GHz)
  4 Intel® Gigabit Ethernet NIC ports
  M.2 SATA SSD
  8GB eMMC module on board
  Intel® AES-NI support
ghost commented 11 months ago

I am concerned with logs that read Fault trap 12: page fault while in kernel mode or Fatal trap 9: general protection fault while in kernel mode with a stack trace including vpanic() ... , panic() ..., and resolves in a KDB: enter: panic

Unless the support tag is just preliminary triage, the logs do seem to indicate that an issue is not user inflicted using the GUI, especially on a clean install.

If you believe there is no concern from the software side, I can follow up with Protectli who sell their hardware with your software. Hardware does goes bad sometimes, maybe that is what is happening.

AdSchellevis commented 11 months ago

it's most likely hardware or driver related indeed, our forum is usually a better place to discuss these kind of issues.

ghost commented 11 months ago

Okay thanks for the pointers. I will chase this issue and see if I can resolve it on the hardware side of things.

I suppose this issue can be closed now, however if you leave it open for a bit more I can provide closure that it indeed was a driver or hardware issue when I am able to resolve it

AdSchellevis commented 11 months ago

I don't mind leaving it open, if someone else walks in with similar issues on a similar device it might help you both. When it's concluded on your end, it's also a good moment to close it. Issues without owner automatically close in 6 months anyway.

OPNsense-bot commented 5 months ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.