morey-tech / homelab

0 stars 0 forks source link

Quasar Unreachable #45

Open morey-tech opened 3 weeks ago

morey-tech commented 3 weeks ago

While streaming episode v0.5.0 of StruggleOps, I found that host quasar (192.168.1.30:8006) was unreachable.

morey-tech commented 3 weeks ago

I could not access the host terminal, and the display would not turn on. Pressing the power button momentarily did not shut down the host. I was able to shut it down by pressing and holding the power button. Then rebooted it and it was available again on the network.

morey-tech commented 3 weeks ago

Last journal entry was Aug 23 14:01:16, almost 7 hours before the reboot.

Aug 23 14:01:16 quasar pvestatd[1247]: status update time (7.190 seconds)
-- Boot 670a31232df344d8b371cc771df5ac38 --
Aug 23 20:46:06 quasar kernel: Linux version 6.8.4-2-pve (build@proxmox)
morey-tech commented 3 weeks ago

The system stats in Proxmox stop around the same time. 2024-08-23-20-56-22

morey-tech commented 3 weeks ago

All the journalctl error entries around the incident. Nothing really happening around the last entry at 14:01.

Aug 19 04:39:38 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 19 04:39:45 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 19 04:39:55 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 19 04:40:12 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:30:03 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:39:31 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:39:43 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:39:51 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:40:01 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:40:17 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:30:00 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:39:32 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:39:45 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:39:52 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:40:02 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:40:18 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:30:01 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:39:35 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:39:48 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:39:55 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:40:05 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:40:22 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:30:03 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:39:38 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:39:51 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:39:58 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:40:08 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:40:25 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 12:00:30 quasar pveproxy[4183702]: problem with client ::ffff:172.16.0.2; Connection reset by peer
Aug 23 12:33:38 quasar pveproxy[2150496]: problem with client ::ffff:172.16.0.2; Connection reset by peer
Aug 23 13:44:07 quasar pveproxy[2169430]: problem with client ::ffff:172.16.0.2; Connection reset by peer
-- Boot 670a31232df344d8b371cc771df5ac38 --
Aug 23 20:46:06 quasar kernel: pci 0000:00:07.2: DPC: RP PIO log size 0 is invalid
Aug 23 20:46:06 quasar kernel: Bluetooth: hci0: Failed to load firmware file (-2)
Aug 23 20:46:06 quasar kernel: Bluetooth: hci0: Failed to set up firmware (-2)
morey-tech commented 3 weeks ago

The timing of the EXT4-fs errors corresponds with the backup schedule for the guests on Proxmox (04:30 daily), which is likely related to that and unrelated to this issue.

morey-tech commented 2 weeks ago

It happened again. I found plex unavailable today at 16:00. The last log entry was at Sep 02 01:20:50

Sep 02 01:04:50 quasar postfix/qmgr[1205]: E8BAC201162: from=<root@quasar.home.morey.tech>, size=69186, nrcpt=1 (queue active)
Sep 02 01:05:20 quasar postfix/smtp[265130]: connect to smtp.google.com[142.250.31.27]:25: Connection timed out
Sep 02 01:05:50 quasar postfix/smtp[265130]: connect to smtp.google.com[142.250.31.26]:25: Connection timed out
Sep 02 01:06:20 quasar postfix/smtp[265130]: connect to smtp.google.com[142.251.111.27]:25: Connection timed out
Sep 02 01:06:20 quasar postfix/smtp[265130]: connect to smtp.google.com[2607:f8b0:4004:c19::1b]:25: Network is unreachable
Sep 02 01:06:20 quasar postfix/smtp[265130]: connect to smtp.google.com[2607:f8b0:4004:c19::1a]:25: Network is unreachable
Sep 02 01:06:20 quasar postfix/smtp[265130]: E8BAC201162: to=<info+quasar.home@morey.tech>, relay=none, delay=246342, delays=246252/0.01/90/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[26>
Sep 02 01:09:50 quasar postfix/qmgr[1205]: 185D320111D: from=<root@quasar.home.morey.tech>, size=69180, nrcpt=1 (queue active)
Sep 02 01:10:20 quasar postfix/smtp[266934]: connect to smtp.google.com[142.250.31.26]:25: Connection timed out
Sep 02 01:10:50 quasar postfix/smtp[266934]: connect to smtp.google.com[142.251.111.27]:25: Connection timed out
Sep 02 01:11:20 quasar postfix/smtp[266934]: connect to smtp.google.com[142.251.111.26]:25: Connection timed out
Sep 02 01:11:20 quasar postfix/smtp[266934]: connect to smtp.google.com[2607:f8b0:4004:c19::1a]:25: Network is unreachable
Sep 02 01:11:20 quasar postfix/smtp[266934]: connect to smtp.google.com[2607:f8b0:4004:c0b::1a]:25: Network is unreachable
Sep 02 01:11:20 quasar postfix/smtp[266934]: 185D320111D: to=<info+quasar.home@morey.tech>, relay=none, delay=73828, delays=73738/0.01/90/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[2607>
Sep 02 01:17:01 quasar CRON[269595]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 02 01:17:01 quasar CRON[269597]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 02 01:17:01 quasar CRON[269595]: pam_unix(cron:session): session closed for user root
Sep 02 01:19:50 quasar postfix/qmgr[1205]: 63BE22010C3: from=<root@quasar.home.morey.tech>, size=2818, nrcpt=1 (queue active)
Sep 02 01:19:50 quasar postfix/qmgr[1205]: 661372010CF: from=<root@quasar.home.morey.tech>, size=2838, nrcpt=1 (queue active)
Sep 02 01:19:50 quasar postfix/smtp[270612]: connect to smtp.google.com[2607:f8b0:4004:c19::1b]:25: Network is unreachable
Sep 02 01:19:50 quasar postfix/smtp[270613]: connect to smtp.google.com[2607:f8b0:4004:c19::1a]:25: Network is unreachable
Sep 02 01:19:50 quasar postfix/smtp[270613]: connect to smtp.google.com[2607:f8b0:4004:c0b::1b]:25: Network is unreachable
Sep 02 01:19:50 quasar postfix/smtp[270613]: connect to smtp.google.com[2607:f8b0:4004:c19::1b]:25: Network is unreachable
Sep 02 01:20:20 quasar postfix/smtp[270612]: connect to smtp.google.com[142.250.31.26]:25: Connection timed out
Sep 02 01:20:20 quasar postfix/smtp[270613]: connect to smtp.google.com[142.251.111.27]:25: Connection timed out
Sep 02 01:20:50 quasar postfix/smtp[270612]: connect to smtp.google.com[142.250.31.27]:25: Connection timed out
Sep 02 01:20:50 quasar postfix/smtp[270613]: connect to smtp.google.com[142.250.31.27]:25: Connection timed out
Sep 02 01:20:50 quasar postfix/smtp[270612]: connect to smtp.google.com[2607:f8b0:4004:c0b::1b]:25: Network is unreachable
Sep 02 01:20:50 quasar postfix/smtp[270612]: connect to smtp.google.com[2607:f8b0:4004:c0b::1a]:25: Network is unreachable
Sep 02 01:20:50 quasar postfix/smtp[270613]: 661372010CF: to=<info+quasar.home@morey.tech>, relay=none, delay=330289, delays=330229/0.01/60/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[14>
Sep 02 01:20:50 quasar postfix/smtp[270612]: 63BE22010C3: to=<info+quasar.home@morey.tech>, relay=none, delay=418494, delays=418434/0.01/60/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[26>
-- Boot cbc1ece1b7034071bb3e8a33f88f6034 --
Sep 02 15:55:57 quasar kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()
Sep 02 15:55:57 quasar kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet
Sep 02 15:55:57 quasar kernel: KERNEL supported cpus:
Sep 02 15:55:57 quasar kernel:   Intel GenuineIntel
Sep 02 15:55:57 quasar kernel:   AMD AuthenticAMD
Sep 02 15:55:57 quasar kernel:   Hygon HygonGenuine
Sep 02 15:55:57 quasar kernel:   Centaur CentaurHauls
Sep 02 15:55:57 quasar kernel:   zhaoxin   Shanghai  
Sep 02 15:55:57 quasar kernel: x86/tme: not enabled by BIOS
Sep 02 15:55:57 quasar kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Sep 02 15:55:57 quasar kernel: BIOS-provided physical RAM map:
Xnapper-2024-09-02-16 00 27
morey-tech commented 2 weeks ago

IMG_4376

The Call Trace on the screen indicates a kernel panic.

https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

morey-tech commented 1 week ago

One of the LXCs was getting oom killed due to insufficient memory allocation. This may be the cause of the system halting as another user reported this behaviour on Reddit. I've increased the memory on the container and will monitor for another oom kill followed by a system halt.

morey-tech commented 1 week ago

Since fixing the OOM issues with the LXC container, the host has run into another kernel panic around Sep 07 03:38:03 while still running the 2x 48GB DIMMs.

root@quasar:~# journalctl -p err
Sep 07 03:38:03 quasar kernel: BUG: unable to handle page fault for address: 00000000000359e0
Sep 07 03:38:03 quasar kernel: #PF: supervisor write access in kernel mode
Sep 07 03:38:03 quasar kernel: #PF: error_code(0x0002) - not-present page
Detailed Logs

``` root@quasar:~# journalctl Sep 07 03:38:03 quasar kernel: BUG: unable to handle page fault for address: 00000000000359e0 Sep 07 03:38:03 quasar kernel: #PF: supervisor write access in kernel mode Sep 07 03:38:03 quasar kernel: #PF: error_code(0x0002) - not-present page Sep 07 03:38:03 quasar kernel: PGD 0 P4D 0 Sep 07 03:38:03 quasar kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI Sep 07 03:38:03 quasar kernel: CPU: 13 PID: 1984 Comm: kvm Tainted: P O 6.8.4-2-pve #1 Sep 07 03:38:03 quasar kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024 Sep 07 03:38:03 quasar kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x276/0x2d0 Sep 07 03:38:03 quasar kernel: Code: 90 49 8b 14 24 48 85 d2 74 f5 eb e7 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00> Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fc33db998 EFLAGS: 00010002 Sep 07 03:38:03 quasar kernel: RAX: 00000000000359e0 RBX: ffff8d2d9719c1c8 RCX: 0000000000380000 Sep 07 03:38:03 quasar kernel: RDX: 00000000000020c5 RSI: 0000000083198319 RDI: ffff8d2d9719c1c8 Sep 07 03:38:03 quasar kernel: RBP: ffffb52fc33db9b8 R08: 0000000000000000 R09: 0000000000000000 Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d44cf8b59c0 Sep 07 03:38:03 quasar kernel: R13: 0000000000000000 R14: 000000000000000d R15: 0000000000000002 Sep 07 03:38:03 quasar kernel: FS: 00007ed9470ee4c0(0000) GS:ffff8d44cf880000(0000) knlGS:0000000000000000 Sep 07 03:38:03 quasar kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 07 03:38:03 quasar kernel: CR2: 00000000000359e0 CR3: 00000001288d6000 CR4: 0000000000f52ef0 Sep 07 03:38:03 quasar kernel: PKRU: 55555554 Sep 07 03:38:03 quasar kernel: Call Trace: Sep 07 03:38:03 quasar kernel: Sep 07 03:38:03 quasar kernel: ? show_regs+0x6d/0x80 Sep 07 03:38:03 quasar kernel: ? __die+0x24/0x80 Sep 07 03:38:03 quasar kernel: ? page_fault_oops+0x176/0x500 Sep 07 03:38:03 quasar kernel: ? update_cfs_group+0xcf/0xf0 Sep 07 03:38:03 quasar kernel: ? psi_group_change+0x1fb/0x460 Sep 07 03:38:03 quasar kernel: ? do_user_addr_fault+0x2f9/0x6b0 Sep 07 03:38:03 quasar kernel: ? exc_page_fault+0x83/0x1b0 Sep 07 03:38:03 quasar kernel: ? asm_exc_page_fault+0x27/0x30 Sep 07 03:38:03 quasar kernel: ? native_queued_spin_lock_slowpath+0x276/0x2d0 Sep 07 03:38:03 quasar kernel: _raw_spin_lock_irqsave+0x5c/0x80 Sep 07 03:38:03 quasar kernel: remove_wait_queue+0x17/0x60 Sep 07 03:38:03 quasar kernel: poll_freewait+0x42/0xb0 Sep 07 03:38:03 quasar kernel: do_sys_poll+0x3a9/0x610 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: ? __pfx_pollwake+0x10/0x10 Sep 07 03:38:03 quasar kernel: __x64_sys_ppoll+0xde/0x170 Sep 07 03:38:03 quasar kernel: do_syscall_64+0x84/0x180 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x93/0x180 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x93/0x180 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x93/0x180 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x93/0x180 Sep 07 03:38:03 quasar kernel: ? irqentry_exit+0x43/0x50 Sep 07 03:38:03 quasar kernel: entry_SYSCALL_64_after_hwframe+0x73/0x7b Sep 07 03:38:03 quasar kernel: RIP: 0033:0x7ed949b55256 Sep 07 03:38:03 quasar kernel: Code: 7c 24 08 e8 6c 95 f8 ff 4c 8b 54 24 18 48 8b 74 24 10 41 b8 08 00 00 00 41 89 c1 48 8b 7c 24 08 4c 89> Sep 07 03:38:03 quasar kernel: RSP: 002b:00007ffc0fd32c90 EFLAGS: 00000293 ORIG_RAX: 000000000000010f Sep 07 03:38:03 quasar kernel: RAX: ffffffffffffffda RBX: 000063d00f351ce0 RCX: 00007ed949b55256 Sep 07 03:38:03 quasar kernel: RDX: 00007ffc0fd32cb0 RSI: 0000000000000010 RDI: 000063d0105775c0 Sep 07 03:38:03 quasar kernel: RBP: 00007ffc0fd32d1c R08: 0000000000000008 R09: 0000000000000000 Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffc0fd32cb0 Sep 07 03:38:03 quasar kernel: R13: 000063d00f351ce0 R14: 000063d00e3cdee8 R15: 00007ffc0fd32d20 Sep 07 03:38:03 quasar kernel: Sep 07 03:38:03 quasar kernel: Modules linked in: dm_snapshot tcp_diag inet_diag 8021q garp mrp veth ebtable_filter ebtables ip_set ip6tab> Sep 07 03:38:03 quasar kernel: mt792x_lib i915 snd_intel_sdw_acpi polyval_generic mt76_connac_lib ghash_clmulni_intel snd_hda_codec mt76 > Sep 07 03:38:03 quasar kernel: CR2: 00000000000359e0 Sep 07 03:38:03 quasar kernel: ---[ end trace 0000000000000000 ]--- Sep 07 03:38:03 quasar kernel: general protection fault, maybe for address 0x0: 0000 [#2] PREEMPT SMP NOPTI Sep 07 03:38:03 quasar kernel: CPU: 18 PID: 1558222 Comm: .NET ThreadPool Tainted: P D O 6.8.4-2-pve #1 Sep 07 03:38:03 quasar kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024 Sep 07 03:38:03 quasar kernel: RIP: 0010:futex_wait+0xc1/0x120 Sep 07 03:38:03 quasar kernel: Code: 48 8b 45 d0 65 48 2b 04 25 28 00 00 00 75 6d 48 83 c4 60 89 d0 5b 41 5c 41 5d 41 5e 41 5f 5d 31 d2 31> Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fee48fe50 EFLAGS: 00010246 Sep 07 03:38:03 quasar kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 Sep 07 03:38:03 quasar kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Sep 07 03:38:03 quasar kernel: RBP: 70cd60e5b6d5b6d5 R08: 0000000000000000 R09: 0000000000000000 Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000089 Sep 07 03:38:03 quasar kernel: R13: 0000000000000009 R14: 0000000000000089 R15: e68ae68a19d919d9 Sep 07 03:38:03 quasar kernel: FS: 00007278e7400700(0000) GS:ffff8d44cfb00000(0000) knlGS:0000000000000000 Sep 07 03:38:03 quasar kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 07 03:38:03 quasar kernel: CR2: 000076e62b3e0380 CR3: 00000002a6874000 CR4: 0000000000f52ef0 Sep 07 03:38:03 quasar kernel: PKRU: 55555554 Sep 07 03:38:03 quasar kernel: Call Trace: Sep 07 03:38:03 quasar kernel: Sep 07 03:38:03 quasar kernel: ? show_regs+0x6d/0x80 Sep 07 03:38:03 quasar kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x276/0x2d0 Sep 07 03:38:03 quasar kernel: ? die_addr+0x37/0xa0 Sep 07 03:38:03 quasar kernel: Code: 90 49 8b 14 24 48 85 d2 74 f5 eb e7 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00> Sep 07 03:38:03 quasar kernel: ? exc_general_protection+0x1db/0x480 Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fc33db998 EFLAGS: 00010002 Sep 07 03:38:03 quasar kernel: ? asm_exc_general_protection+0x27/0x30 Sep 07 03:38:03 quasar kernel: RAX: 00000000000359e0 RBX: ffff8d2d9719c1c8 RCX: 0000000000380000 Sep 07 03:38:03 quasar kernel: ? futex_wait+0xc1/0x120 Sep 07 03:38:03 quasar kernel: RDX: 00000000000020c5 RSI: 0000000083198319 RDI: ffff8d2d9719c1c8 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x84/0x180 Sep 07 03:38:03 quasar kernel: RBP: ffffb52fc33db9b8 R08: 0000000000000000 R09: 0000000000000000 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x93/0x180 Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d44cf8b59c0 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x93/0x180 Sep 07 03:38:03 quasar kernel: R13: 0000000000000000 R14: 000000000000000d R15: 0000000000000002 Sep 07 03:38:03 quasar kernel: ? do_syscall_64+0x93/0x180 Sep 07 03:38:03 quasar kernel: FS: 00007ed9470ee4c0(0000) GS:ffff8d44cf880000(0000) knlGS:0000000000000000 Sep 07 03:38:03 quasar kernel: ? entry_SYSCALL_64_after_hwframe+0x73/0x7b Sep 07 03:38:03 quasar kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 07 03:38:03 quasar kernel: Sep 07 03:38:03 quasar kernel: CR2: 00000000000359e0 CR3: 00000001288d6000 CR4: 0000000000f52ef0 Sep 07 03:38:03 quasar kernel: Modules linked in: Sep 07 03:38:03 quasar kernel: PKRU: 55555554 Sep 07 03:38:03 quasar kernel: dm_snapshot tcp_diag Sep 07 03:38:03 quasar kernel: note: kvm[1984] exited with irqs disabled Sep 07 03:38:03 quasar kernel: inet_diag Sep 07 03:38:03 quasar kernel: note: kvm[1984] exited with preempt_count 1 Sep 07 03:38:03 quasar kernel: 8021q garp mrp veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables ipt> Sep 07 03:38:03 quasar kernel: mt76_connac_lib ghash_clmulni_intel snd_hda_codec mt76 btusb sha256_ssse3 btrtl snd_hda_core sha1_ssse3 dr> Sep 07 03:38:03 quasar kernel: ---[ end trace 0000000000000000 ]--- Sep 07 03:38:03 quasar kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x276/0x2d0 Sep 07 03:38:03 quasar kernel: Code: 90 49 8b 14 24 48 85 d2 74 f5 eb e7 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00> Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fc33db998 EFLAGS: 00010002 Sep 07 03:38:03 quasar kernel: RAX: 00000000000359e0 RBX: ffff8d2d9719c1c8 RCX: 0000000000380000 Sep 07 03:38:03 quasar kernel: RDX: 00000000000020c5 RSI: 0000000083198319 RDI: ffff8d2d9719c1c8 Sep 07 03:38:03 quasar kernel: RBP: ffffb52fc33db9b8 R08: 0000000000000000 R09: 0000000000000000 Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d44cf8b59c0 Sep 07 03:38:03 quasar kernel: R13: 0000000000000000 R14: 000000000000000d R15: 0000000000000002 Sep 07 03:38:03 quasar kernel: FS: 00007278e7400700(0000) GS:ffff8d44cfb00000(0000) knlGS:0000000000000000 Sep 07 03:38:03 quasar kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 07 03:38:03 quasar kernel: CR2: 000076e62b3e0380 CR3: 00000002a6874000 CR4: 0000000000f52ef0 Sep 07 03:38:03 quasar kernel: PKRU: 55555554 -- Boot c8ebc2afcd4343c697dad254aaad978d -- Sep 07 12:05:23 quasar kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) > Sep 07 12:05:23 quasar kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet ```

morey-tech commented 1 week ago

I've swapped back in the supported 2x 32GB DIMMs and ran upgrades on the host, which includes an upgraded proxmox-kernel.

Upgrade Details

``` The following NEW packages will be installed: proxmox-kernel-6.8.12-1-pve-signed The following packages will be upgraded: base-files bash bind9-dnsutils bind9-host bind9-libs ceph-common ceph-fuse curl distro-info-data gnutls-bin ifupdown2 initramfs-tools initramfs-tools-core krb5-locales less libarchive13 libc-bin libc-l10n libc6 libcephfs2 libcurl3-gnutls libcurl4 libfreetype6 libglib2.0-0 libgnutls-dane0 libgnutls30 libgnutlsxx30 libgssapi-krb5-2 libgstreamer-plugins-base1.0-0 libk5crypto3 libkrb5-3 libkrb5support0 libnss-systemd libnvpair3linux libopeniscsiusr libpam-systemd libproxmox-acme-perl libproxmox-acme-plugins libpve-cluster-api-perl libpve-cluster-perl libpve-common-perl libpve-guest-common-perl libpve-notify-perl libpve-rs-perl libpve-storage-perl libpython3.11-minimal libpython3.11-stdlib libqt5core5a libqt5dbus5 libqt5network5 librados2 libradosstriper1 librbd1 librgw2 libseccomp2 libssl3 libsystemd-shared libsystemd0 libudev1 libuutil3linux libzfs4linux libzpool5linux locales nano open-iscsi openssh-client openssh-server openssh-sftp-server openssl postfix proxmox-backup-client proxmox-backup-file-restore proxmox-firewall proxmox-kernel-6.8 proxmox-secure-boot-support proxmox-termproxy proxmox-widget-toolkit pve-cluster pve-container pve-docs pve-esxi-import-tools pve-firewall pve-firmware pve-ha-manager pve-manager pve-qemu-kvm python3-ceph-argparse python3-ceph-common python3-cephfs python3-idna python3-rados python3-rbd python3-rgw python3.11 python3.11-minimal qemu-server shim-helpers-amd64-signed shim-signed shim-signed-common shim-unsigned spl ssh systemd systemd-boot systemd-boot-efi systemd-sysv udev zfs-initramfs zfs-zed zfsutils-linux ```

morey-tech commented 1 week ago

Set up remote syslog to catch kernel panic next time.

https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

root@quasar:$ cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet netconsole=5555@192.168.1.30/enp2s0f0np0,5555@192.168.1.31/dc:a6:32:01:cf:0d loglevel=7"
morey-tech@raspberrypi:~ $ cat /etc/rsyslog.d/01-netconsole-collector.conf 
# Start UDP server on port 5555
$ModLoad imudp
$UDPServerRun 5555

# Define templates
$template NetconsoleFile,"/var/log/netconsole/%fromhost-ip%.log"
$template NetconsoleFormat,"%rawmsg%"

# Accept endline characters (unfortunatelly these options are global)
$EscapeControlCharactersOnReceive off
$DropTrailingLFOnReception off

# Store collected logs using templates without local ones
:fromhost-ip, !isequal, "127.0.0.1"     ?NetconsoleFile;NetconsoleFormat

# Discard logs match the rule above
& ~
morey-tech commented 1 week ago

Using vmbr0, it can't use the interface because it's not set up yet.

Sep 07 22:56:55 quasar kernel: i40e 0000:02:00.1 enp2s0f1np1: renamed from eth1
Sep 07 22:56:55 quasar kernel: i40e 0000:02:00.0 enp2s0f0np0: renamed from eth0
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: local port 5555
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: local IPv4 address 192.168.1.30
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: interface 'vmbr0'
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: remote port 5555
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: remote IPv4 address 192.168.1.31
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: remote ethernet address dc:a6:32:01:cf:0d
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: vmbr0 doesn't exist, aborting
Sep 07 22:56:55 quasar kernel: netconsole: cleaning up

Using enp2s0f0np0 it works until vmbr0 set up.

Sep 07 22:59:07 quasar kernel: netpoll: netconsole: local port 5555
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: local IPv4 address 192.168.1.30
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: interface 'enp2s0f0np0'
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: remote port 5555
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: remote IPv4 address 192.168.1.31
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: remote ethernet address dc:a6:32:01:cf:0d
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: device enp2s0f0np0 not up yet, forcing it
Sep 07 22:59:07 quasar kernel: printk: legacy console [netcon0] enabled
Sep 07 22:59:07 quasar kernel: netconsole: network logging started
...
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered blocking state
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered disabled state
Sep 07 22:59:09 quasar kernel: netconsole: network logging stopped on interface enp2s0f0np0 as it is joining a master device
Sep 07 22:59:09 quasar kernel: i40e 0000:02:00.0 enp2s0f0np0: entered allmulticast mode
Sep 07 22:59:09 quasar kernel: i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered blocking state
Sep 07 22:59:09 quasar kernel: i40e 0000:02:00.0: entering allmulti mode.
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered forwarding state
morey-tech commented 1 week ago

I can set it to vmbr0 after the system is running:

modprobe netconsole  netconsole=5555@192.168.1.30/vmbr0,5555@192.168.1.31/dc:a6:32:01:cf:0d

FYI: 5555@192.168.1.30/ (excluding the interface name) defaults to eth0.

To update settings, first run

rmmod netconsole

If you get an error:

modprobe: ERROR: could not insert 'netconsole': No such device

Then run this first without any parameters:

modprobe netconsole

https://www.apalrd.net/posts/2024/pve_netconsole/

Logs are successfully being sent to the raspberrypi Screenshot from 2024-09-08 08-53-19

morey-tech commented 1 week ago

Set up a BMC interface with a static IP on enp89s0.

modprobe netconsole netconsole=5555@192.168.1.32/enp89s0,5555@192.168.1.31/dc:a6:32:01:cf:0d

Confirmed working (receiving logs on raspberrypi. Screenshot from 2024-09-08 09-00-53

Updated grub defaults:

root@quasar:~# cat /etc/default/grub
# ...
GRUB_CMDLINE_LINUX_DEFAULT="quiet netconsole=5555@192.168.1.32/enp89s0,5555@192.168.1.31/dc:a6:32:01:cf:0d loglevel=7"

root@quasar:~# update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-1-pve
Found initrd image: /boot/initrd.img-6.8.12-1-pve
Found linux image: /boot/vmlinuz-6.8.4-2-pve
Found initrd image: /boot/initrd.img-6.8.4-2-pve
Found memtest86+ 64bit EFI image: /boot/memtest86+x64.efi
Adding boot menu entry for UEFI Firmware Settings ...
done

Shutdown all the guests on the host, then triggered kernel panic:

root@quasar:~# echo c > /proc/sysrq-trigger

Which was received on raspberrypi:

morey-tech@raspberrypi:~ $ cat /var/log/netconsole/192.168.1.32.log
# ...
[36313.874812] sysrq: Trigger a crash
[36313.875632] Kernel panic - not syncing: sysrq triggered crash
[36313.876420] CPU: 8 PID: 283605 Comm: bash Tainted: P           O       6.8.12-1-pve #1
[36313.877204] Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024
[36313.877903] Call Trace:

Screenshot from 2024-09-08 09-04-57

On boot, netconsole was configured correctly:

Sep 08 09:08:36 quasar kernel: netpoll: netconsole: local port 5555
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: local IPv4 address 192.168.1.32
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: interface 'enp89s0'
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: remote port 5555
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: remote IPv4 address 192.168.1.31
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: remote ethernet address dc:a6:32:01:cf:0d
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: device enp89s0 not up yet, forcing it
Sep 08 09:08:36 quasar kernel: igc 0000:59:00.0 enp89s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Sep 08 09:08:36 quasar kernel: printk: legacy console [netcon0] enabled
Sep 08 09:08:36 quasar kernel: netconsole: network logging started

The logs are sent to raspberrypi around 9.095761 seconds into loading the kernel.

[36313.957987] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
[    9.095761] vmbr0: port 1(enp2s0f0np0) entered blocking state
[    9.096704] vmbr0: port 1(enp2s0f0np0) entered forwarding state