Open chriexpe opened 1 month ago
I have the similar issue; I have an all ZFS Proxmox server that crashes upon backing up a VM at least once a week. Backups are scheduled daily. I also did a memtest that did not return any issues.
Journalctl logs right before hard crash.
root@gtr7pro:~# journalctl --since "2024-06-04 15:55" --until "2024-06-04 16:11"
Jun 04 15:58:54 gtr7pro pmxcfs[1295]: [status] notice: received log
Jun 04 15:58:54 gtr7pro pvedaemon[1438]: <hiveadmin@pam> starting task UPID:gtr7pro:00004D14:000AE00F:665F71FE:vzdump::hiveadmin@pam:
Jun 04 15:58:54 gtr7pro pvedaemon[19732]: INFO: starting new backup job: vzdump --mailnotification failure --mailto systems@example.com --compress zstd --prune-backups 'keep-last=3' --notes-template '{{guestname}}' --storage local --mode snapshot --all 1 --fleecing 0 --node gtr7pro
Jun 04 15:58:54 gtr7pro pvedaemon[19732]: INFO: Starting Backup of VM 102 (qemu)
Jun 04 15:58:57 gtr7pro pvedaemon[19732]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Jun 04 15:59:00 gtr7pro kernel: hrtimer: interrupt took 5490 ns
Jun 04 15:59:08 gtr7pro kernel: BUG: unable to handle page fault for address: 0000040000000430
Jun 04 15:59:08 gtr7pro kernel: #PF: supervisor read access in kernel mode
Jun 04 15:59:08 gtr7pro kernel: #PF: error_code(0x0000) - not-present page
root@gtr7pro:~#
ZFS status reported errors for file that is one of the vm's disks. I was able to reboot, keep the vm powered off and have ZFS fix the pool.
root@gtr7pro:~# zfs version
zfs-2.2.3-pve2
zfs-kmod-2.2.3-pve2
root@gtr7pro:~# uname -a
Linux gtr7pro 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
root@gtr7pro:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
@chriexpe Review your ZFS ARC Max memory configuration. By default this is set to half the system memory and collides with Linux's memory.. Check out the comments on https://github.com/openzfs/zfs/issues/10255
I think this Proxmox article would also apply to Unraid - sysadmin_zfs_limit_memory_usage
System information
These last few weeks I've been getting a lot of kernel panic from, what appears to be a ZFS pool that I've created a year ago, and honestly idk what to do aside from removing it and starting fresh (and loose a few TBs of data).
The pool in question is the main one from my server, formed by 3x8TB HDD Seagate Exos 7e8 that is connected to a RAID Card (passthrough mode), this pool is constantly being written by a NVR, and these crashes are random apparently (or it coincidentally crashes after I write/read a file after the pool has been running for quite a while), this is the error:
And it keeps repeating this same error.
If I check it with
zpool status
(after kernel panic) everything looks normal:Note that I already scrubbed it and there was no error.
I don't remember exactly when it started, but it's probably after upgrading to Unraid 6.12.9, tho I rolled back to 6.12.8 and it kept crashing.
I ran memtest too, but it didn't report any error.