Open layer7gmbh opened 4 years ago
could you describe your environment a little bit?
kvm / vm settings/config for example ?
what kvm version/management solution?
no corruption/dataloss besides partition-table loss?
where are the zvol's located on?
what is meant by "when we allowed aggressive swapping" ? mind that zfs needs some amount of memory and kvm is also very picky about available memory. running without swap / additional virtual memory may result in OOM when starting virtual guests, so keep an eye on not overcommitting memory too much on KVM/ZFS hosts.
oh - and you don'r run swap on ZVOL, do you ? ( https://github.com/openzfs/zfs/wiki/FAQ#using-a-zvol-for-a-swap-device )
Hi devZero,
sorry, i somehow missed your message.
example libvirt definition:
<domain type='kvm'>
<name>test123</name>
<uuid>test123</uuid>
<memory unit='KiB'>4192000</memory>
<currentMemory unit='KiB'>4192000</currentMemory>
<blkiotune>
<device>
<path>/dev/zvol/kvm-storage/test123</path>
<read_iops_sec>500</read_iops_sec>
<write_iops_sec>500</write_iops_sec>
<read_bytes_sec>51200000</read_bytes_sec>
<write_bytes_sec>51200000</write_bytes_sec>
</device>
</blkiotune>
<vcpu placement='static'>2</vcpu>
<cputune>
<period>100000</period>
<quota>100000</quota>
</cputune>
<os>
<type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
</features>
<cpu mode='host-model' check='partial'/>
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
<suspend-to-mem enabled='no'/>
<suspend-to-disk enabled='no'/>
</pm>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/zvol/kvm-storage/test123'/>
<target dev='sda' bus='sata'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<controller type='usb' index='0' model='qemu-xhci' ports='15'>
<address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</controller>
<controller type='sata' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
</controller>
<controller type='pci' index='0' model='pcie-root'/>
<controller type='pci' index='1' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='1' port='0x10'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='2' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='2' port='0x11'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
</controller>
<controller type='pci' index='3' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='3' port='0x12'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
</controller>
<controller type='pci' index='4' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='4' port='0x13'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
</controller>
<controller type='pci' index='5' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='5' port='0x14'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
</controller>
<controller type='pci' index='6' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='6' port='0x15'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
</controller>
<controller type='pci' index='7' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='7' port='0x16'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
</controller>
<controller type='pci' index='8' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='8' port='0x17'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
</controller>
<controller type='pci' index='9' model='pcie-to-pci-bridge'>
<model name='pcie-pci-bridge'/>
<address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</controller>
<controller type='virtio-serial' index='0'>
<address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
</controller>
<interface type='network'>
<mac address='00:50:56:ad:32:79'/>
<source network='Public Network'/>
<target dev='test123'/>
<model type='e1000'/>
<address type='pci' domain='0x0000' bus='0x09' slot='0x01' function='0x0'/>
</interface>
<serial type='pty'>
<target type='isa-serial' port='0'>
<model name='isa-serial'/>
</target>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
<channel type='unix'>
<target type='virtio' name='org.qemu.guest_agent.0'/>
<address type='virtio-serial' controller='0' bus='0' port='1'/>
</channel>
<input type='tablet' bus='usb'>
<address type='usb' bus='0' port='1'/>
</input>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<graphics type='vnc' port='-1' autoport='yes' listen='0.0.0.0' keymap='en-us' passwd='test123'>
<listen type='address' address='0.0.0.0'/>
</graphics>
<video>
<model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
</video>
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</memballoon>
<rng model='virtio'>
<backend model='random'>/dev/urandom</backend>
<address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</rng>
</devices>
</domain>
virsh -V Virsh command line tool of libvirt 5.10.0
virsh -V Virsh command line tool of libvirt 6.1.0
does not matter. Kernel does not matter too ( had 4 LT, 5 LT, latest 5 ).
Distribution does not matter ( happens on arch, fedora, debian ).
There is no corruption. Just the partitiontable gets killed:
parted /dev/zvol/kvm-storage/test123
GNU Parted 3.3
Using /dev/zd32
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Error: /dev/zd32: unrecognised disk label
Model: Unknown (unknown)
Disk /dev/zd32: 130GB
Sector size (logical/physical): 512B/8192B
Partition Table: unknown
Disk Flags:
(parted) quit
zpool with cache and log ( but that does not matter too, it happens also to a single disk backed zpool with or without cache/log )
# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
kvm-storage 10.9T 1.25T 9.62T - - 0% 11% 1.00x ONLINE -
raidz1 5.44T 783G 4.67T - - 0% 14.1% - ONLINE
pci-0000:82:00.0-scsi-0:1:23:0 - - - - - - - - ONLINE
pci-0000:82:00.0-scsi-0:1:24:0 - - - - - - - - ONLINE
pci-0000:82:00.0-scsi-0:1:26:0 - - - - - - - - ONLINE
raidz1 5.44T 500G 4.95T - - 0% 8.97% - ONLINE
pci-0000:82:00.0-scsi-0:1:27:0 - - - - - - - - ONLINE
pci-0000:82:00.0-scsi-0:1:28:0 - - - - - - - - ONLINE
pci-0000:82:00.0-scsi-0:1:29:0 - - - - - - - - ONLINE
logs - - - - - - - - -
mirror 74.5G 1.63M 74.5G - - 0% 0.00% - ONLINE
pci-0000:82:00.0-scsi-0:1:30:0-part1 - - - - - - - - ONLINE
pci-0000:82:00.0-scsi-0:1:31:0-part1 - - - - - - - - ONLINE
cache - - - - - - - - -
pci-0000:82:00.0-scsi-0:1:30:0-part2 372G 177G 196G - - 0% 47.5% - ONLINE
pci-0000:82:00.0-scsi-0:1:31:0-part2 372G 178G 195G - - 0% 47.7% - ONLINE
And no, we dont run swap on zfs volumes, its not that bad.
qemu-img convert loves to be No. 1 reason to run into trouble. Here is a crash from today:
[19906.081791] INFO: task qemu-img:2206615 blocked for more than 122 seconds.
[19906.081795] Tainted: P OE 5.4.31-1-lts #1
[19906.081796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[19906.081799] qemu-img D 0 2206615 3613123 0x00000080
[19906.081805] Call Trace:
[19906.081828] ? __schedule+0x2f3/0x750
[19906.081834] ? bit_wait_timeout+0x90/0x90
[19906.081838] schedule+0x39/0xa0
[19906.081844] io_schedule+0x12/0x40
[19906.081849] bit_wait_io+0xd/0x50
[19906.081854] __wait_on_bit+0x2a/0x90
[19906.081860] out_of_line_wait_on_bit+0x92/0xb0
[19906.081868] ? var_wake_function+0x20/0x20
[19906.081876] __block_write_begin_int+0x258/0x660
[19906.081885] ? blkdev_direct_IO+0x4d0/0x4d0
[19906.081890] ? blkdev_direct_IO+0x4d0/0x4d0
[19906.081894] block_write_begin+0x48/0xf0
[19906.081904] generic_perform_write+0xd1/0x1d0
[19906.081911] __generic_file_write_iter+0xf9/0x1d0
[19906.081917] blkdev_write_iter+0xd9/0x160
[19906.081926] new_sync_write+0x16a/0x210
[19906.081932] vfs_write+0xb6/0x1a0
[19906.081937] ksys_pwrite64+0x65/0xa0
[19906.081946] do_syscall_64+0x4e/0x140
[19906.081953] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[19906.081958] RIP: 0033:0x7f7add9d807f
[19906.081970] Code: Bad RIP value.
[19906.081971] RSP: 002b:00007f7adbffc100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[19906.081975] RAX: ffffffffffffffda RBX: 00007f7ad6ebe000 RCX: 00007f7add9d807f
[19906.081977] RDX: 0000000000200000 RSI: 00007f7ad6ebe000 RDI: 000000000000000a
[19906.081978] RBP: 00007f7ad91d7790 R08: 0000000000000000 R09: 00000000ffffffff
[19906.081980] R10: 00000004c181ee00 R11: 0000000000000293 R12: 0000000000000000
[19906.081982] R13: 00007f7adcc52078 R14: 00007f7adcc8baf0 R15: 00007f7adcc52000
followed by more fun:
[20397.591743] INFO: task systemd-udevd:548 blocked for more than 122 seconds.
[20397.591748] Tainted: P OE 5.4.31-1-lts #1
[20397.591749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20397.591752] systemd-udevd D 0 548 1 0x00000180
[20397.591758] Call Trace:
[20397.591780] ? __schedule+0x2f3/0x750
[20397.591785] schedule+0x39/0xa0
[20397.591798] schedule_preempt_disabled+0xa/0x10
[20397.591802] __mutex_lock.isra.0+0x19b/0x500
[20397.591813] __blkdev_get+0x7f/0x550
[20397.591817] ? blkdev_get_by_dev+0x50/0x50
[20397.591824] do_dentry_open+0x13d/0x3a0
[20397.591829] path_openat+0x5d0/0x1650
[20397.591835] do_filp_open+0xab/0x120
[20397.591841] do_sys_open+0x19e/0x240
[20397.591851] do_syscall_64+0x4e/0x140
[20397.591857] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20397.591863] RIP: 0033:0x7f1476f4222b
[20397.591875] Code: Bad RIP value.
[20397.591877] RSP: 002b:00007fffd0a37da0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[20397.591880] RAX: ffffffffffffffda RBX: 00007fffd0a37eb0 RCX: 00007f1476f4222b
[20397.591882] RDX: 00000000000a0800 RSI: 000055eeaf6bec30 RDI: 00000000ffffff9c
[20397.591884] RBP: 000055eeaf6bec30 R08: 00007f1476dc35b0 R09: 0000000000000000
[20397.591885] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
[20397.591887] R13: 000055eeaf39c000 R14: 000055eeaf79e740 R15: 00007fffd0a37e78
[20520.471723] INFO: task systemd-udevd:548 blocked for more than 245 seconds.
[20520.471728] Tainted: P OE 5.4.31-1-lts #1
[20520.471729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20520.471732] systemd-udevd D 0 548 1 0x00000184
[20520.471738] Call Trace:
[20520.471759] ? __schedule+0x2f3/0x750
[20520.471764] schedule+0x39/0xa0
[20520.471768] schedule_preempt_disabled+0xa/0x10
[20520.471772] __mutex_lock.isra.0+0x19b/0x500
[20520.471780] __blkdev_get+0x7f/0x550
[20520.471784] ? blkdev_get_by_dev+0x50/0x50
[20520.471790] do_dentry_open+0x13d/0x3a0
[20520.471795] path_openat+0x5d0/0x1650
[20520.471801] do_filp_open+0xab/0x120
[20520.471807] do_sys_open+0x19e/0x240
[20520.471815] do_syscall_64+0x4e/0x140
[20520.471820] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20520.471825] RIP: 0033:0x7f1476f4222b
[20520.471837] Code: Bad RIP value.
[20520.471839] RSP: 002b:00007fffd0a37da0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[20520.471843] RAX: ffffffffffffffda RBX: 00007fffd0a37eb0 RCX: 00007f1476f4222b
[20520.471845] RDX: 00000000000a0800 RSI: 000055eeaf6bec30 RDI: 00000000ffffff9c
[20520.471847] RBP: 000055eeaf6bec30 R08: 00007f1476dc35b0 R09: 0000000000000000
[20520.471848] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
[20520.471850] R13: 000055eeaf39c000 R14: 000055eeaf79e740 R15: 00007fffd0a37e78
and another qemu-img process, with another job:
[20643.361798] INFO: task qemu-img:226645 blocked for more than 122 seconds.
[20643.361803] Tainted: P OE 5.4.31-1-lts #1
[20643.361804] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20643.361807] qemu-img D 0 226645 3613123 0x00000080
[20643.361813] Call Trace:
[20643.361834] ? __schedule+0x2f3/0x750
[20643.361839] ? bit_wait_timeout+0x90/0x90
[20643.361842] schedule+0x39/0xa0
[20643.361846] io_schedule+0x12/0x40
[20643.361850] bit_wait_io+0xd/0x50
[20643.361853] __wait_on_bit+0x2a/0x90
[20643.361857] out_of_line_wait_on_bit+0x92/0xb0
[20643.361865] ? var_wake_function+0x20/0x20
[20643.361872] __block_write_begin_int+0x258/0x660
[20643.361878] ? blkdev_direct_IO+0x4d0/0x4d0
[20643.361882] ? blkdev_direct_IO+0x4d0/0x4d0
[20643.361885] block_write_begin+0x48/0xf0
[20643.361893] generic_perform_write+0xd1/0x1d0
[20643.361898] __generic_file_write_iter+0xf9/0x1d0
[20643.361902] blkdev_write_iter+0xd9/0x160
[20643.361911] new_sync_write+0x16a/0x210
[20643.361916] vfs_write+0xb6/0x1a0
[20643.361922] ksys_pwrite64+0x65/0xa0
[20643.361930] do_syscall_64+0x4e/0x140
[20643.361936] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20643.361942] RIP: 0033:0x7f7add9d807f
[20643.361954] Code: Bad RIP value.
[20643.361956] RSP: 002b:00007f7ad5fbb100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[20643.361959] RAX: ffffffffffffffda RBX: 00007f7ad6ebe000 RCX: 00007f7add9d807f
[20643.361961] RDX: 0000000000200000 RSI: 00007f7ad6ebe000 RDI: 000000000000000a
[20643.361963] RBP: 00007f7ad91d7790 R08: 0000000000000000 R09: 00000000ffffffff
[20643.361964] R10: 0000000bab81d200 R11: 0000000000000293 R12: 0000000000000000
[20643.361966] R13: 00007f7adcc52078 R14: 00007f7adcc8b7e0 R15: 00007f7adcc52000
And here even qemu exploded with segfaults:
[31253.939829] GPT:Primary header thinks Alt. header is not at the end of the disk.
[31253.939833] GPT:104857599 != 106954751
[31253.939835] GPT:Alternate GPT header not at the end of the disk.
[31253.939836] GPT:104857599 != 106954751
[31253.939838] GPT: Use GNU Parted to correct GPT errors.
[31253.939862] zd160: p1 p2 p3 p4
[31552.474344] CPU 3/KVM[1202750]: segfault at 2 ip 0000559d22be5e50 sp 00007fd7b6875dd0 error 4 in qemu-system-x86_64[559d2299e000+524000]
[31552.474359] Code: 6b 00 5b c3 66 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a2 03 00 00 53 48 89 f3 48 8d 35 d3 a6 34 00 48 83 ec 78 <44> 0f b7 62 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 eb 75 34 00 45
[31552.474430] audit: type=1701 audit(1586633184.304:172): auid=4294967295 uid=65534 gid=992 ses=4294967295 pid=1202723 comm=43505520332F4B564D exe="/usr/bin/qemu-system-x86_64" sig=11 res=1
[31552.531843] audit: type=1130 audit(1586633184.364:173): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-3385535-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[31552.875387] audit: type=1131 audit(1586633184.704:174): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-3385535-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[31553.079592] device test123 left promiscuous mode
[31553.079637] audit: type=1700 audit(1586633184.904:175): dev=test123 prom=0 old_prom=256 auid=4294967295 uid=65534 gid=992 ses=4294967295
[31579.671578] INFO: task systemd-udevd:584669 blocked for more than 122 seconds.
[31579.671583] Tainted: P OE 5.4.31-1-lts #1
[31579.671584] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[31579.671587] systemd-udevd D 0 584669 1 0x00000184
[31579.671593] Call Trace:
[31579.671614] ? __schedule+0x2f3/0x750
[31579.671619] schedule+0x39/0xa0
[31579.671623] schedule_preempt_disabled+0xa/0x10
[31579.671628] __mutex_lock.isra.0+0x19b/0x500
[31579.671638] __blkdev_get+0x7f/0x550
[31579.671643] ? blkdev_get_by_dev+0x50/0x50
[31579.671649] do_dentry_open+0x13d/0x3a0
[31579.671654] path_openat+0x5d0/0x1650
[31579.671666] ? do_user_addr_fault+0x1e6/0x440
[31579.671669] do_filp_open+0xab/0x120
[31579.671676] do_sys_open+0x19e/0x240
[31579.671685] do_syscall_64+0x4e/0x140
[31579.671692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[31579.671697] RIP: 0033:0x7fa20862b22b
[31579.671715] Code: Bad RIP value.
[31579.671717] RSP: 002b:00007ffd1aa14f60 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[31579.671720] RAX: ffffffffffffffda RBX: 00007ffd1aa15070 RCX: 00007fa20862b22b
[31579.671722] RDX: 00000000000a0800 RSI: 000055b7cc8bd500 RDI: 00000000ffffff9c
[31579.671724] RBP: 000055b7cc8bd500 R08: 00007fa2084ac5b0 R09: 0000000000000000
[31579.671726] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
[31579.671727] R13: 000055b7cc5c6000 R14: 000055b7cc8d0360 R15: 00007ffd1aa15038
the qemu-img command is:
qemu-img convert -p -O raw $qcow2_or_$vmdk /dev/zvol/$zpool/$zfsvolume
The volumes are created via:
zfs create -V $sizeG $zpool/$zfsvolume
Compression is on.
Inside of the guests, the party looks like this:
It would be quiet supernice, if someone could give a hint why this happens like this.
Its effectively reproduceable. The % probability is proportional to the IO happening on the server.
Running 2x qemu-img at once is basically a 100% gurantee that zfsvolumes will explode and the kvm qemu processes will die with that in segfault or how ever.
A simple:
will take
79 seconds
while running:
while nothing else is running on this server.
Output iotop:
Total DISK READ : 112.22 M/s | Total DISK WRITE : 151.38 M/s
Actual DISK READ: 112.22 M/s | Actual DISK WRITE: 67.04 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1466 be/4 root 602.00 K 0.00 B 0.00 % 81.16 % [txg_sync]
26982 be/4 root 14.46 G 14.65 G 0.00 % 59.11 % qemu-img convert -p -O raw test1-flat.vmdk /dev/zvol/kvm-storage/test1
453 be/4 root 0.00 B 0.00 B 0.00 % 21.14 % [kworker/u65:8-events_freezable_power_]
442 be/4 root 0.00 B 0.00 B 0.00 % 18.18 % [kworker/u65:4-events_power_efficient]
709 be/4 root 0.00 B 0.00 B 0.00 % 15.21 % [l2arc_feed]
444 be/4 root 0.00 B 0.00 B 0.00 % 12.27 % [kworker/u65:6-events_power_efficient]
7 be/4 root 0.00 B 0.00 B 0.00 % 9.21 % [kworker/u65:0-events_freezable_power_]
1154 be/0 root 0.00 B 0.00 B 0.00 % 8.38 % [z_null_iss]
536113 be/4 root 0.00 B 0.00 B 0.00 % 13.29 % [kworker/u65:1-events_freezable_power_]
1155 be/0 root 0.00 B 0.00 B 0.00 % 4.36 % [z_null_int]
428 be/4 root 0.00 B 0.00 B 0.00 % 3.47 % [kworker/u65:2-events_unbound]
1191 be/0 root 0.00 B 0.00 B 0.00 % 0.05 % [z_wr_int]
1195 be/0 root 0.00 B 0.00 B 0.00 % 0.04 % [z_wr_int]
1197 be/0 root 0.00 B 0.00 B 0.00 % 0.04 % [z_wr_int]
1194 be/0 root 0.00 B 0.00 B 0.00 % 0.04 % [z_wr_int]
1190 be/0 root 0.00 B 0.00 B 0.00 % 0.03 % [z_wr_int]
[...]
top will look like:
top - 10:58:02 up 11 min, 4 users, load average: 42.91, 24.19, 10.25
Tasks: 527 total, 2 running, 525 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 8.9 sy, 0.0 ni, 70.2 id, 19.8 wa, 0.4 hi, 0.5 si, 0.0 st
MiB Mem : 193407.7 total, 102309.5 free, 23659.5 used, 67438.7 buff/cache
MiB Swap: 92649.0 total, 92649.0 free, 0.0 used. 168320.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26982 root 20 0 252192 34596 7540 S 29.5 0.0 1:18.38 qemu-img
[...]
40 load on a 32 core CPU with a single zfsvolume doing IO ?
So all in all it seems to me, that the zfs scheduler will leave zvol's back to die without any IO time for minuets, leading to this situations.
Howto force zfs to give at least a minimum IO time to everything ?
does this happen with zvols only? how does it behave with ordinary file, e.g. if you convert vmdk to zfs backed file?
also have a look at this one: https://github.com/openzfs/zfs/issues/7631
and maybe also this one: https://bugzilla.proxmox.com/show_bug.cgi?id=1453
oh, and what about this one ?
This is:
HDD -> Zvol ( same process blocking multiple times )
[ 3194.406737] INFO: task qemu-img:27318 blocked for more than 245 seconds.
[ 3194.406741] Tainted: P OE 5.4.31-1-lts #1
[ 3194.406742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3194.406744] qemu-img D 0 27318 6011 0x00004080
[ 3194.406750] Call Trace:
[ 3194.406765] ? __schedule+0x2f3/0x750
[ 3194.406770] ? bit_wait_timeout+0x90/0x90
[ 3194.406773] schedule+0x39/0xa0
[ 3194.406777] io_schedule+0x12/0x40
[ 3194.406780] bit_wait_io+0xd/0x50
[ 3194.406783] __wait_on_bit+0x2a/0x90
[ 3194.406787] out_of_line_wait_on_bit+0x92/0xb0
[ 3194.406793] ? var_wake_function+0x20/0x20
[ 3194.406799] __block_write_begin_int+0x258/0x660
[ 3194.406805] ? blkdev_direct_IO+0x4d0/0x4d0
[ 3194.406809] ? blkdev_direct_IO+0x4d0/0x4d0
[ 3194.406812] block_write_begin+0x48/0xf0
[ 3194.406818] generic_perform_write+0xd1/0x1d0
[ 3194.406823] __generic_file_write_iter+0xf9/0x1d0
[ 3194.406827] blkdev_write_iter+0xd9/0x160
[ 3194.406834] new_sync_write+0x16a/0x210
[ 3194.406840] vfs_write+0xb6/0x1a0
[ 3194.406844] ksys_pwrite64+0x65/0xa0
[ 3194.406851] do_syscall_64+0x4e/0x140
[ 3194.406857] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3194.406862] RIP: 0033:0x7fe7fea0a07f
[ 3194.406871] Code: Bad RIP value.
[ 3194.406873] RSP: 002b:00007fe7f2ffc100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[ 3194.406876] RAX: ffffffffffffffda RBX: 00007fe7ef9ff000 RCX: 00007fe7fea0a07f
[ 3194.406878] RDX: 0000000000fffe00 RSI: 00007fe7ef9ff000 RDI: 000000000000000a
[ 3194.406880] RBP: 00007fe7fb0e66d0 R08: 0000000000000000 R09: 00000000ffffffff
[ 3194.406881] R10: 0000001dc5821000 R11: 0000000000000293 R12: 0000000000000000
[ 3194.406883] R13: 00007fe7fdc52078 R14: 00007fe7fdc8b8c0 R15: 00007fe7fdc52000
[ 3440.156743] INFO: task qemu-img:27318 blocked for more than 491 seconds.
[ 3440.156747] Tainted: P OE 5.4.31-1-lts #1
[ 3440.156748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3440.156750] qemu-img D 0 27318 6011 0x00004080
[ 3440.156755] Call Trace:
[ 3440.156771] ? __schedule+0x2f3/0x750
[ 3440.156776] ? bit_wait_timeout+0x90/0x90
[ 3440.156779] schedule+0x39/0xa0
[ 3440.156783] io_schedule+0x12/0x40
[ 3440.156787] bit_wait_io+0xd/0x50
[ 3440.156790] __wait_on_bit+0x2a/0x90
[ 3440.156794] out_of_line_wait_on_bit+0x92/0xb0
[ 3440.156800] ? var_wake_function+0x20/0x20
[ 3440.156806] __block_write_begin_int+0x258/0x660
[ 3440.156811] ? blkdev_direct_IO+0x4d0/0x4d0
[ 3440.156816] ? blkdev_direct_IO+0x4d0/0x4d0
[ 3440.156819] block_write_begin+0x48/0xf0
[ 3440.156825] generic_perform_write+0xd1/0x1d0
[ 3440.156830] __generic_file_write_iter+0xf9/0x1d0
[ 3440.156834] blkdev_write_iter+0xd9/0x160
[ 3440.156840] new_sync_write+0x16a/0x210
[ 3440.156846] vfs_write+0xb6/0x1a0
[ 3440.156851] ksys_pwrite64+0x65/0xa0
[ 3440.156857] do_syscall_64+0x4e/0x140
[ 3440.156863] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3440.156868] RIP: 0033:0x7fe7fea0a07f
[ 3440.156877] Code: Bad RIP value.
[ 3440.156879] RSP: 002b:00007fe7f2ffc100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[ 3440.156882] RAX: ffffffffffffffda RBX: 00007fe7ef9ff000 RCX: 00007fe7fea0a07f
[ 3440.156884] RDX: 0000000000fffe00 RSI: 00007fe7ef9ff000 RDI: 000000000000000a
[ 3440.156886] RBP: 00007fe7fb0e66d0 R08: 0000000000000000 R09: 00000000ffffffff
[ 3440.156887] R10: 0000001dc5821000 R11: 0000000000000293 R12: 0000000000000000
[ 3440.156889] R13: 00007fe7fdc52078 R14: 00007fe7fdc8b8c0 R15: 00007fe7fdc52000
Happend maybe 4-5 times
With the same going to the zpool mountpoint /kvm-storage:
i was not able to reproduce the issue.
So maybe its indeed something that goes for zvol's only.
@devZer0 THANK YOU very much for doing so great effort checking other sources for informations about this issue !
I read through all of them, and i will create now a /etc/modprobe.d/zfs.conf with:
zvol_threads=8 zvol_request_sync=1
And restart/continue my testing now.
The good news:
It feels like something happend ( i had to start 2x qemu-img convert + 1x file transfere to the mounted /kvm-storage zfs datastore ) -- before already 1-2x qemu-img convert were enough. So maybe, it got a bit more resilent...
The bad news:
815650 be/4 root 2.31 G 1968.23 M 0.00 % 25.42 % qemu-img convert -p -O raw test1.vmdk /dev/zvol/kvm-storage/test1
1340209 be/4 root 9.53 G 9.39 G 0.00 % 25.31 % qemu-img convert -p -O raw test2.vmdk /dev/zvol/kvm-storage/test2
After 2-3 minuets:
815650 be/4 root 2.31 G 1968.23 M 0.00 % 21.59 % qemu-img convert -p -O raw test1 /dev/zvol/kvm-storage/test1
1340209 be/4 root 9.53 G 9.39 G 0.00 % 19.65 % qemu-img convert -p -O raw test2/dev/zvol/kvm-storage/test2
... nothing moved.
And a VM looks like this:
BUT so far, no kernel dumps =)
So things are far away from being ok.
Zfs simply stops IO time to zvol's for too long. Have to find a way to force zfs to give at least a bit IO time to everyone, and not just let something have ZERO IO time for multiple minuets.
mhh ,weird.
i would try triggering that problem with an mostly incompressible and consistent "write only" load to see if it makes a difference. maybe qemu-img is doing something special...
maybe "shred /dev/zvol/kvm-storage/test1" or something like that.... (or disable compression and use dd if=/dev/zero of=/dev/zvol/kvm-storage/test1 bs=1024k... )
anyhow, it indeed it looks like some scheduling/starvation issue....
Ok,
i am not able to kill a single zvol IO ( fs check of qemu VM ) with 3x shred on zvol.
But it will report streigt frozen ... failed command: READ FPDMA QUEUED ... status: { DRDY } and the whole stuff with a single qemu-img. -- Intresting to mention, that it seems
1x qemu-img is going to kill things while
1x qemu-img AND 3x shred at the same time seems even after minuets not to kill things... maybe this qemu-img get so little read IO because of the shreds, that it actually does not get the chance to block things.
So this qemu-img does a pretty good job ( what ever it actually does ).
I will start now trying my luck with
i can easily reproduce your issue on my recent proxmox system with latest zfs 0.8.3
while issuing the following command on the proxmox host (filling the zvol of an offline vm):
dd if=/dev/urandom of=/dev/zvol/hddpool/vms/vm-111-disk-0 bs=1024k
shortly after, inside a kvm vm , the following command shows no read progress anymore:
while true;do dd if=/dev/sda bs=1024k |pv >/dev/null;done
in the end, inside vm i'm getting:
[sda] abort
i.e. i/o completely got stuck
what makes things even worse:
the vm is not using zvol on hddpool but on ssdpool, so apparently it's about i/o starvation in the kernel/zfs layer and not in the disk layer....
i would call this a serious issue (which is not about qemu-img at all).
not sure if kvm or zfs is to blame here...
please , can you have a look @behlendorf ?
yes, it does not matter what OS or version.
i tested here fedora 31/30, centos 7 and archlinux with versions between 0.7.x and latest 0.8.3 in different hardware setups ( raid controller, hba, nativ, jbod ).
So far its more and more safe to say:
qemu-img on datasets work without any issues. qemu-img on zvol will most probably cause issues ( except you just have some tiny amounts of data ( 0-10 GB images ).
I tried this settings in different combinations and numbers:
zfs zfs_vdev_sync_write_max_active=5
zfs_vdev_sync_write_min_active=2
zfs_vdev_sync_read_min_active=20
zfs_vdev_sync_read_max_active=30
zfs zfs_vdev_async_write_max_active=5
zfs zfs_vdev_async_write_min_active=2
zfs zfs_vdev_async_read_min_active=20
zfs zfs_vdev_async_read_max_active=30
zfs zvol_request_sync=1
zfs zfs_vdev_async_write_active_max_dirty_percent=5
zfs_vdev_async_write_active_min_dirty_percent=1
aswell as:
vm.dirty_ratio = 5
vm.dirty_background_ratio=1
vm.vfs_cache_pressure = 500
vm.dirtytime_expire_seconds = 20
vm.swappiness = 0
Sometimes the error comes earlier, sometimes it ( seems ) to come later, while there is no 200% failsave way to reproduce this, as sometimes it might work, most times not ( and if not then sometimes earlier or later not ).
So far the work around:
Dont use qemu-img with zvol's
ionice wont help too by the way.
i have found with further testing that this seems completely unrelated to kvm.
i also get completely stalling read from zvol on ssd when doing large streaming write to zvol on hdd, i.e. :
dd if=/dev/urandom bs=1024k |pv > /dev/zvol/hddpool/zvol1
makes
dd if=/dev/zvol/ssdpool/zvol1 bs=1024k |pv >/dev/null
hang completely
furthermore, after about 4-5gb of writes, performance drops significantly, by at least 3/4 (from >120mb/s to <30mb/s) - even without reading in parallel.
yes, thats indeed another issue.
But honestly i am already quiet happy i found the cause of this segfaults and qemu-kvm crash's.
But you are right, zfs's zvol code, at least here for linux seems to have quiet some space for improvement.
while with qemu-img something happen to zvol, that things go down the hill, while with dd it does not lead to crashes ( while the performance might still feels like my 3,5" floppy ).
while with qemu-img something happen to zvol, that things go down the hill, while with dd it does not lead to crashes
i think this is just a matter of difference in write performance/volume.
try disabling compression on the zvol and try writing with larger blocksize (dd if=/dev/zero bs=1024k....) to it, i bet you will see the same issue like with qemu-img
btw, i can NOT reproduce the issue with blocksize=1k, as this won't fully saturate i/o - but with bs=8k i can easily trigger it.
btw, on my system ashift for hddpool is at 9 and for ssdpool is at 12.
yeah, well unfortunatelly you are right, if the IO just takes long enough
dd if=raw of=zvol bs=4k conv=fdatasync,sparse
things will go down the hill with 500 GB of raw data.
During the dd, two random KVM's died....
[177814.217766] CPU 1/KVM[3965349]: segfault at 2 ip 000055dfa44a7e50 sp 00007fe18f1e5dd0 error 4 in qemu-system-x86_64[55dfa4260000+524000]
[177814.217777] Code: 6b 00 5b c3 66 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a2 03 00 00 53 48 89 f3 48 8d 35 d3 a6 34 00 48 83 ec 78 <44> 0f b7 62 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 eb 75 34 00 45
[177854.346110] CPU 1/KVM[263273]: segfault at 2 ip 0000558c0b258e50 sp 00007fcc10d67dd0 error 4 in qemu-system-x86_64[558c0b011000+524000]
[177854.346125] Code: 6b 00 5b c3 66 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a2 03 00 00 53 48 89 f3 48 8d 35 d3 a6 34 00 48 83 ec 78 <44> 0f b7 62 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 eb 75 34 00 45
So we can consider zvol's close to unuseable for really reliable stuff. Back to lvm i guess... super sad.
here we go.
unfortunately, i don't see significant improvement.
compared to file or lvm, writes to zvol are slow, while reads severely starve
Test System:
platform: kvm/proxmox virtual machine backed by 500gb vdisk/raw file on zfs with lz4 compression.
hardware: fujitsu rx300 s6 / 1x xeon L5630
guest os distro: CentOS Linux release 7.7.1908 (Core)
guest os kernel : 3.10.0-1062.18.1.el7.x86_64
guest os zpool: no raid or anything, just simple testpool on 500gb vdisk ( /dev/sdc )
test for reading:
# while true;do dd if=$path1 bs=1024k |pv -i60 >/dev/null ;done
test for writing:
# while true;do dd if=/dev/zero bs=1024k |pv -i60 >$path2 ;done
(start test in 2 terminal windows concurrently)
read path1 write path2 backend
287MB/s /dev/vg/lv1 360MB/s /dev/vg/lv2 lvm logical volume
950kB/s /dev/zvol/testpool/zvol1 40MB/s /dev/zvol/testpool/zvol2 zfs v0.8.3-1 zvol w/o compr
240MB/s /testpool/file1 230MB/s /testpool/file2 zfs v0.8.3-1 file w/o compr
1,2MB/s /dev/zvol/testpool/zvol1 55MB/s /dev/zvol/testpool/zvol2 zfs v0.8.0-728_ga7929f3 zvol w/o compr
230MB/s /testpool/file1 230MB/s /testpool/file2 zfs v0.8.0-728_ga7929f3 file w/o compr
Hi guys, I'm not sure if this is related but I believe I'm in the same situation. I have latest versions:
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve) pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754) pve-kernel-5.4: 6.2-2 pve-kernel-helper: 6.2-2 pve-kernel-5.4.41-1-pve: 5.4.41-1 pve-kernel-5.4.34-1-pve: 5.4.34-2 zfs-0.8.4-pve1 zfs-kmod-0.8.4-pve1
I'm getting crashes only after high load. It usually hangs out the VMs entierly and the zpools traffic. I can replicate this by just installing proxmox, making a zfs mirror pool, one vm created and after installing normally any operating system copying a file would work fine until you copy it multiple times and it gets frozen from high speeds to 0 bytes.
The host has 128GB RAM, and a Threadripper; so I don't think there is any memory or cpu limitation involved.
I've found out this crash in syslog: Jun 10 02:18:48 pve kernel: INFO: task txg_sync:61435 blocked for more than 120 seconds. Jun 10 02:18:48 pve kernel: Tainted: P OE 5.4.41-1-pve #1 Jun 10 02:18:48 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 02:18:48 pve kernel: txg_sync D 0 61435 2 0x80004000 Jun 10 02:18:48 pve kernel: Call Trace: Jun 10 02:18:48 pve kernel: schedule+0x2e6/0x6f0 Jun 10 02:18:48 pve kernel: schedule+0x33/0xa0 Jun 10 02:18:48 pve kernel: schedule_timeout+0x152/0x300 Jun 10 02:18:48 pve kernel: ? next_timer_interrupt+0xd0/0xd0 Jun 10 02:18:48 pve kernel: io_schedule_timeout+0x1e/0x50 Jun 10 02:18:48 pve kernel: cv_timedwait_common+0x12f/0x170 [spl] Jun 10 02:18:48 pve kernel: ? wait_woken+0x80/0x80 Jun 10 02:18:48 pve kernel: cv_timedwait_io+0x19/0x20 [spl] Jun 10 02:18:48 pve kernel: zio_wait+0x130/0x270 [zfs] Jun 10 02:18:48 pve kernel: vdev_uberblock_sync_list+0x91/0x130 [zfs] Jun 10 02:18:48 pve kernel: vdev_config_sync+0x157/0x240 [zfs] Jun 10 02:18:48 pve kernel: spa_sync+0xbd8/0xfa0 [zfs] Jun 10 02:18:48 pve kernel: txg_sync_thread+0x2d9/0x4c0 [zfs] Jun 10 02:18:48 pve kernel: ? txg_thread_exit.isra.12+0x60/0x60 [zfs] Jun 10 02:18:48 pve kernel: thread_generic_wrapper+0x74/0x90 [spl] Jun 10 02:18:48 pve kernel: kthread+0x120/0x140 Jun 10 02:18:48 pve kernel: ? thread_exit+0x20/0x20 [spl] Jun 10 02:18:48 pve kernel: ? kthread_park+0x90/0x90 Jun 10 02:18:48 pve kernel: ret_from_fork+0x22/0x40 Jun 10 02:18:48 pve kernel: INFO: task kvm:90079 blocked for more than 120 seconds. Jun 10 02:18:48 pve kernel: Tainted: P OE 5.4.41-1-pve #1 Jun 10 02:18:48 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 02:18:48 pve kernel: kvm D 0 90079 1 0x00004000 Jun 10 02:18:48 pve kernel: Call Trace: Jun 10 02:18:48 pve kernel: __schedule+0x2e6/0x6f0 Jun 10 02:18:48 pve kernel: schedule+0x33/0xa0 Jun 10 02:18:48 pve kernel: cv_wait_common+0x104/0x130 [spl] Jun 10 02:18:48 pve kernel: ? wait_woken+0x80/0x80 Jun 10 02:18:48 pve kernel: cv_wait+0x15/0x20 [spl] Jun 10 02:18:48 pve kernel: zil_commit_impl+0x241/0xd90 [zfs] Jun 10 02:18:48 pve kernel: zil_commit+0x3d/0x60 [zfs] Jun 10 02:18:48 pve kernel: zvol_request+0x233/0x3c0 [zfs] Jun 10 02:18:48 pve kernel: generic_make_request+0xcf/0x310 Jun 10 02:18:48 pve kernel: submit_bio+0x46/0x1c0 Jun 10 02:18:48 pve kernel: ? radix_tree_lookup+0xd/0x10 Jun 10 02:18:48 pve kernel: ? blkg_lookup_slowpath+0x27/0x50 Jun 10 02:18:48 pve kernel: submit_bio_wait+0x59/0x90 Jun 10 02:18:48 pve kernel: blkdev_issue_flush+0x8e/0xc0 Jun 10 02:18:48 pve kernel: blkdev_fsync+0x35/0x50 Jun 10 02:18:48 pve kernel: vfs_fsync_range+0x48/0x80 Jun 10 02:18:48 pve kernel: ? fget_light+0x59/0x70 Jun 10 02:18:48 pve kernel: do_fsync+0x3d/0x70 Jun 10 02:18:48 pve kernel: x64_sys_fdatasync+0x17/0x20 Jun 10 02:18:48 pve kernel: do_syscall_64+0x57/0x190 Jun 10 02:18:48 pve kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 10 02:18:48 pve kernel: RIP: 0033:0x7ff1985e02e7 Jun 10 02:18:48 pve kernel: Code: Bad RIP value. Jun 10 02:18:48 pve kernel: RSP: 002b:00007fef758324c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b Jun 10 02:18:48 pve kernel: RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00007ff1985e02e7 Jun 10 02:18:48 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000011 Jun 10 02:18:48 pve kernel: RBP: 00007ff18b66ca10 R08: 0000000000000000 R09: 00000000ffffffff Jun 10 02:18:48 pve kernel: R10: 00007fef758324b0 R11: 0000000000000293 R12: 000055f07aa18e12 Jun 10 02:18:48 pve kernel: R13: 00007ff18b66ca78 R14: 00007ff18b0dd180 R15: 00007ff18b139010 Jun 10 02:18:48 pve kernel: INFO: task kvm:60453 blocked for more than 120 seconds. Jun 10 02:18:48 pve kernel: Tainted: P OE 5.4.41-1-pve #1 Jun 10 02:18:48 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 02:18:48 pve kernel: kvm D 0 60453 1 0x00004000 Jun 10 02:18:48 pve kernel: Call Trace: Jun 10 02:18:48 pve kernel: schedule+0x2e6/0x6f0 Jun 10 02:18:48 pve kernel: schedule+0x33/0xa0 Jun 10 02:18:48 pve kernel: cv_wait_common+0x104/0x130 [spl] Jun 10 02:18:48 pve kernel: ? wait_woken+0x80/0x80 Jun 10 02:18:48 pve kernel: __cv_wait+0x15/0x20 [spl] Jun 10 02:18:48 pve kernel: zil_commit_impl+0x241/0xd90 [zfs] Jun 10 02:18:48 pve kernel: zil_commit+0x3d/0x60 [zfs] Jun 10 02:18:48 pve kernel: zvol_request+0x233/0x3c0 [zfs] Jun 10 02:18:48 pve kernel: generic_make_request+0xcf/0x310 Jun 10 02:18:48 pve kernel: submit_bio+0x46/0x1c0 Jun 10 02:18:48 pve kernel: ? radix_tree_lookup+0xd/0x10 Jun 10 02:18:48 pve kernel: ? blkg_lookup_slowpath+0x27/0x50 Jun 10 02:18:48 pve kernel: submit_bio_wait+0x59/0x90 Jun 10 02:18:48 pve kernel: blkdev_issue_flush+0x8e/0xc0 Jun 10 02:18:48 pve kernel: blkdev_fsync+0x35/0x50 Jun 10 02:18:48 pve kernel: vfs_fsync_range+0x48/0x80 Jun 10 02:18:48 pve kernel: ? fget_light+0x59/0x70 Jun 10 02:18:48 pve kernel: do_fsync+0x3d/0x70 Jun 10 02:18:48 pve kernel: __x64_sys_fdatasync+0x17/0x20 Jun 10 02:18:48 pve kernel: do_syscall_64+0x57/0x190 Jun 10 02:18:48 pve kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 10 02:18:48 pve kernel: RIP: 0033:0x7f035929a2e7 Jun 10 02:18:48 pve kernel: Code: Bad RIP value. Jun 10 02:18:48 pve kernel: RSP: 002b:00007f034b5794c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b Jun 10 02:18:48 pve kernel: RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00007f035929a2e7 Jun 10 02:18:48 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000011 Jun 10 02:18:48 pve kernel: RBP: 00007f034c46ca10 R08: 0000000000000000 R09: 00000000ffffffff Jun 10 02:18:48 pve kernel: R10: 00007f034b5794b0 R11: 0000000000000293 R12: 000055fd70597e12 Jun 10 02:18:48 pve kernel: R13: 00007f034c46ca78 R14: 00007f034bedce00 R15: 00007f034beac890
FWIW, just in case it inspires anyone's thinking on this:
On a Slackware box with XFS (not ZFS!), under the heaviest of I/O (e.g. multiple rsyncs running), I occasionally see the same KVM...segfault at 2...error 4
failure.
And when it happens, it eats the partition table for the VM's boot "disk" so regularly that I have a dd-based script standing by to replace the first "sector" of the raw disk image file.
I don't, however, see any corruption of the host XFS filesystem.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
@layer7gmbh , what type of kvm host is this? it seems it's NOT proxmox based, but most (if not all, besides this one) of the reports i see on lost partition tables are in the proxmox forum/community.
is vzdump being involved in the process of loosing partition?
anyhow, i think it's not a zfs issue but a kvm/qemu issue: https://bugzilla.proxmox.com/show_bug.cgi?id=2874
@layer7gmbh , can you provide some feedback so we can decide if this needs to be left open ?
regarding hang of virtual machine there is some interesting findings at https://bugzilla.proxmox.com/show_bug.cgi?id=1453 , but not sure if it's relevant here
@layer7gmbh , does this problem still exist ?
@layer7gmbh , you seem to use sata disk in virtual machine configuration.
can you confirm that loosing partition table only happened in virtual machines with virtual sata disk?
apparently, it seems there is an issue with that in qemu: https://bugzilla.proxmox.com/show_bug.cgi?id=2874#c51
@devZer0
probably this kind of issues were not ( only ) related to SATA. Also according to your qemu post they talk about unclean shutdown situations where partitiontables might get lost. That was not the case here.
Unfortunatelly i am not able to tell you much more. With our current solution:
Kernel 5.15.83-1-pve zfs-2.1.7-pve2 zfs-kmod-2.1.7-pve1
It seems we dont run into this kind of issues.
the sector 0 corruption bug is found and being fixed (related to sata/ida & backup interruption), see https://bugzilla.proxmox.com/show_bug.cgi?id=2874#c51
did it happen again for you?
if not, i think we can close this issue !? would you be so kind @layer7gmbh ?
zfs-0.8.3-1 zfs-kmod-0.8.2-1 Kernel Module: 0.8.3-1 SPL: 0.8.3-1 Kernel: 5.3.11-300.fc31.x86_64
We run here sometimes into this issue:
We saw that on other KVM hostmachines too running on ZFS.
It seems the more load we see on the ZFS, the higher the chance of this happens.
After that, the ZFS zvol lost the partitiontable. It can be recreated easily by using something like testdisk.
In addition you have to reinstall the bootloader of the OS to get the system back booting.
This happens randomly across all hostmachines, across all kind of virtual machines, HDD's or CPU's. It does not hit the same VM's. Its all random.
I observed that before, when we allowed aggressive swapping, the ZFS systems started to answer slowly ( all zfs commands took long enough for the dead man timeout ) while the zvol's responded normally. At that time, this issue happened basically every 2-3 days.
After turning off swapping, the issue was gone. Until yesterday, when it just came back out of nowhere.
Same systems using lvm or qcow2 filestorage this issue does not happen.
Someone any idea ? Thank you !