layer7gmbh commented 4 years ago

zfs-0.8.3-1 zfs-kmod-0.8.2-1 Kernel Module: 0.8.3-1 SPL: 0.8.3-1 Kernel: 5.3.11-300.fc31.x86_64

We run here sometimes into this issue:

[8473933.099535] CPU 1/KVM[4043477]: segfault at 2 ip 0000557976af6f37 sp 00007f28459d9eb0 error 4 in qemu-system-x86_64 (deleted)[5579768ca000+44e000]
[8473933.099549] Code: 00 00 00 41 57 41 56 41 55 41 54 49 89 cc b9 a0 03 00 00 55 48 89 f5 48 8d 35 dc 6a 28 00 53 48 89 fb 48 83 c7 38 48 83 ec 38 <44> 0f b7 52 02 4c 8b 7a 08 4c 89 44 24 08 4c 8d 05 f4 41 28 00 45
[8473935.550773] device k3884-yjLmI left promiscuous mode
[8474059.799909] CPU 0/KVM[1214025]: segfault at 2 ip 000056526ae1cf37 sp 00007f11ef4b4eb0 error 4 in qemu-system-x86_64 (deleted)[56526abf0000+44e000]
[8474059.799922] Code: 00 00 00 41 57 41 56 41 55 41 54 49 89 cc b9 a0 03 00 00 55 48 89 f5 48 8d 35 dc 6a 28 00 53 48 89 fb 48 83 c7 38 48 83 ec 38 <44> 0f b7 52 02 4c 8b 7a 08 4c 89 44 24 08 4c 8d 05 f4 41 28 00 45
[8474059.822908] CPU 1/KVM[766470]: segfault at 2 ip 0000559b20e7f757 sp 00007fda44bd3eb0 error 4 in qemu-system-x86_64[559b20c50000+45d000]
[8474059.822917] Code: 00 00 00 41 57 41 56 41 55 41 54 49 89 cc b9 a2 03 00 00 55 48 89 f5 48 8d 35 1c 35 29 00 53 48 89 fb 48 83 c7 38 48 83 ec 38 <44> 0f b7 52 02 4c 8b 7a 08 4c 89 44 24 08 4c 8d 05 b4 08 29 00 45
[8474060.998460] device k3877-2Whqj left promiscuous mode
[8474280.439972] device k3899-PVjeA left promiscuous mode

We saw that on other KVM hostmachines too running on ZFS.

It seems the more load we see on the ZFS, the higher the chance of this happens.

After that, the ZFS zvol lost the partitiontable. It can be recreated easily by using something like testdisk.

In addition you have to reinstall the bootloader of the OS to get the system back booting.

This happens randomly across all hostmachines, across all kind of virtual machines, HDD's or CPU's. It does not hit the same VM's. Its all random.

I observed that before, when we allowed aggressive swapping, the ZFS systems started to answer slowly ( all zfs commands took long enough for the dead man timeout ) while the zvol's responded normally. At that time, this issue happened basically every 2-3 days.

After turning off swapping, the issue was gone. Until yesterday, when it just came back out of nowhere.

Same systems using lvm or qcow2 filestorage this issue does not happen.

Someone any idea ? Thank you !

devZer0 commented 4 years ago

could you describe your environment a little bit?

kvm / vm settings/config for example ?

what kvm version/management solution?

no corruption/dataloss besides partition-table loss?

where are the zvol's located on?

what is meant by "when we allowed aggressive swapping" ? mind that zfs needs some amount of memory and kvm is also very picky about available memory. running without swap / additional virtual memory may result in OOM when starting virtual guests, so keep an eye on not overcommitting memory too much on KVM/ZFS hosts.

oh - and you don'r run swap on ZVOL, do you ? ( https://github.com/openzfs/zfs/wiki/FAQ#using-a-zvol-for-a-swap-device )

layer7gmbh commented 4 years ago

Hi devZero,

sorry, i somehow missed your message.

example libvirt definition:

<domain type='kvm'>
  <name>test123</name>
  <uuid>test123</uuid>
  <memory unit='KiB'>4192000</memory>
  <currentMemory unit='KiB'>4192000</currentMemory>
  <blkiotune>
    <device>
      <path>/dev/zvol/kvm-storage/test123</path>
      <read_iops_sec>500</read_iops_sec>
      <write_iops_sec>500</write_iops_sec>
      <read_bytes_sec>51200000</read_bytes_sec>
      <write_bytes_sec>51200000</write_bytes_sec>
    </device>
  </blkiotune>
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <period>100000</period>
    <quota>100000</quota>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-model' check='partial'/>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/zvol/kvm-storage/test123'/>
      <target dev='sda' bus='sata'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-to-pci-bridge'>
      <model name='pcie-pci-bridge'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <interface type='network'>
      <mac address='00:50:56:ad:32:79'/>
      <source network='Public Network'/>
      <target dev='test123'/>
      <model type='e1000'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x01' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' listen='0.0.0.0' keymap='en-us' passwd='test123'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </rng>
  </devices>
</domain>

virsh -V Virsh command line tool of libvirt 5.10.0

virsh -V Virsh command line tool of libvirt 6.1.0

does not matter. Kernel does not matter too ( had 4 LT, 5 LT, latest 5 ).
Distribution does not matter ( happens on arch, fedora, debian ).

There is no corruption. Just the partitiontable gets killed:

 parted /dev/zvol/kvm-storage/test123
GNU Parted 3.3
Using /dev/zd32
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print                                                            
Error: /dev/zd32: unrecognised disk label
Model: Unknown (unknown)                                                  
Disk /dev/zd32: 130GB
Sector size (logical/physical): 512B/8192B
Partition Table: unknown
Disk Flags: 
(parted) quit

zpool with cache and log ( but that does not matter too, it happens also to a single disk backed zpool with or without cache/log )

# zpool list -v
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
kvm-storage                               10.9T  1.25T  9.62T        -         -     0%    11%  1.00x    ONLINE  -
  raidz1                                  5.44T   783G  4.67T        -         -     0%  14.1%      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:23:0            -      -      -        -         -      -      -      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:24:0            -      -      -        -         -      -      -      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:26:0            -      -      -        -         -      -      -      -  ONLINE  
  raidz1                                  5.44T   500G  4.95T        -         -     0%  8.97%      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:27:0            -      -      -        -         -      -      -      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:28:0            -      -      -        -         -      -      -      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:29:0            -      -      -        -         -      -      -      -  ONLINE  
logs                                          -      -      -        -         -      -      -      -  -
  mirror                                  74.5G  1.63M  74.5G        -         -     0%  0.00%      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:30:0-part1      -      -      -        -         -      -      -      -  ONLINE  
    pci-0000:82:00.0-scsi-0:1:31:0-part1      -      -      -        -         -      -      -      -  ONLINE  
cache                                         -      -      -        -         -      -      -      -  -
  pci-0000:82:00.0-scsi-0:1:30:0-part2     372G   177G   196G        -         -     0%  47.5%      -  ONLINE  
  pci-0000:82:00.0-scsi-0:1:31:0-part2     372G   178G   195G        -         -     0%  47.7%      -  ONLINE

And no, we dont run swap on zfs volumes, its not that bad.

qemu-img convert loves to be No. 1 reason to run into trouble. Here is a crash from today:

[19906.081791] INFO: task qemu-img:2206615 blocked for more than 122 seconds.
[19906.081795]       Tainted: P           OE     5.4.31-1-lts #1
[19906.081796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[19906.081799] qemu-img        D    0 2206615 3613123 0x00000080
[19906.081805] Call Trace:
[19906.081828]  ? __schedule+0x2f3/0x750
[19906.081834]  ? bit_wait_timeout+0x90/0x90
[19906.081838]  schedule+0x39/0xa0
[19906.081844]  io_schedule+0x12/0x40
[19906.081849]  bit_wait_io+0xd/0x50
[19906.081854]  __wait_on_bit+0x2a/0x90
[19906.081860]  out_of_line_wait_on_bit+0x92/0xb0
[19906.081868]  ? var_wake_function+0x20/0x20
[19906.081876]  __block_write_begin_int+0x258/0x660
[19906.081885]  ? blkdev_direct_IO+0x4d0/0x4d0
[19906.081890]  ? blkdev_direct_IO+0x4d0/0x4d0
[19906.081894]  block_write_begin+0x48/0xf0
[19906.081904]  generic_perform_write+0xd1/0x1d0
[19906.081911]  __generic_file_write_iter+0xf9/0x1d0
[19906.081917]  blkdev_write_iter+0xd9/0x160
[19906.081926]  new_sync_write+0x16a/0x210
[19906.081932]  vfs_write+0xb6/0x1a0
[19906.081937]  ksys_pwrite64+0x65/0xa0
[19906.081946]  do_syscall_64+0x4e/0x140
[19906.081953]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[19906.081958] RIP: 0033:0x7f7add9d807f
[19906.081970] Code: Bad RIP value.
[19906.081971] RSP: 002b:00007f7adbffc100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[19906.081975] RAX: ffffffffffffffda RBX: 00007f7ad6ebe000 RCX: 00007f7add9d807f
[19906.081977] RDX: 0000000000200000 RSI: 00007f7ad6ebe000 RDI: 000000000000000a
[19906.081978] RBP: 00007f7ad91d7790 R08: 0000000000000000 R09: 00000000ffffffff
[19906.081980] R10: 00000004c181ee00 R11: 0000000000000293 R12: 0000000000000000
[19906.081982] R13: 00007f7adcc52078 R14: 00007f7adcc8baf0 R15: 00007f7adcc52000

followed by more fun:

[20397.591743] INFO: task systemd-udevd:548 blocked for more than 122 seconds.
[20397.591748]       Tainted: P           OE     5.4.31-1-lts #1
[20397.591749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20397.591752] systemd-udevd   D    0   548      1 0x00000180
[20397.591758] Call Trace:
[20397.591780]  ? __schedule+0x2f3/0x750
[20397.591785]  schedule+0x39/0xa0
[20397.591798]  schedule_preempt_disabled+0xa/0x10
[20397.591802]  __mutex_lock.isra.0+0x19b/0x500
[20397.591813]  __blkdev_get+0x7f/0x550
[20397.591817]  ? blkdev_get_by_dev+0x50/0x50
[20397.591824]  do_dentry_open+0x13d/0x3a0
[20397.591829]  path_openat+0x5d0/0x1650
[20397.591835]  do_filp_open+0xab/0x120
[20397.591841]  do_sys_open+0x19e/0x240
[20397.591851]  do_syscall_64+0x4e/0x140
[20397.591857]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20397.591863] RIP: 0033:0x7f1476f4222b
[20397.591875] Code: Bad RIP value.
[20397.591877] RSP: 002b:00007fffd0a37da0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[20397.591880] RAX: ffffffffffffffda RBX: 00007fffd0a37eb0 RCX: 00007f1476f4222b
[20397.591882] RDX: 00000000000a0800 RSI: 000055eeaf6bec30 RDI: 00000000ffffff9c
[20397.591884] RBP: 000055eeaf6bec30 R08: 00007f1476dc35b0 R09: 0000000000000000
[20397.591885] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
[20397.591887] R13: 000055eeaf39c000 R14: 000055eeaf79e740 R15: 00007fffd0a37e78
[20520.471723] INFO: task systemd-udevd:548 blocked for more than 245 seconds.
[20520.471728]       Tainted: P           OE     5.4.31-1-lts #1
[20520.471729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20520.471732] systemd-udevd   D    0   548      1 0x00000184
[20520.471738] Call Trace:
[20520.471759]  ? __schedule+0x2f3/0x750
[20520.471764]  schedule+0x39/0xa0
[20520.471768]  schedule_preempt_disabled+0xa/0x10
[20520.471772]  __mutex_lock.isra.0+0x19b/0x500
[20520.471780]  __blkdev_get+0x7f/0x550
[20520.471784]  ? blkdev_get_by_dev+0x50/0x50
[20520.471790]  do_dentry_open+0x13d/0x3a0
[20520.471795]  path_openat+0x5d0/0x1650
[20520.471801]  do_filp_open+0xab/0x120
[20520.471807]  do_sys_open+0x19e/0x240
[20520.471815]  do_syscall_64+0x4e/0x140
[20520.471820]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20520.471825] RIP: 0033:0x7f1476f4222b
[20520.471837] Code: Bad RIP value.
[20520.471839] RSP: 002b:00007fffd0a37da0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[20520.471843] RAX: ffffffffffffffda RBX: 00007fffd0a37eb0 RCX: 00007f1476f4222b
[20520.471845] RDX: 00000000000a0800 RSI: 000055eeaf6bec30 RDI: 00000000ffffff9c
[20520.471847] RBP: 000055eeaf6bec30 R08: 00007f1476dc35b0 R09: 0000000000000000
[20520.471848] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
[20520.471850] R13: 000055eeaf39c000 R14: 000055eeaf79e740 R15: 00007fffd0a37e78

and another qemu-img process, with another job:

[20643.361798] INFO: task qemu-img:226645 blocked for more than 122 seconds.
[20643.361803]       Tainted: P           OE     5.4.31-1-lts #1
[20643.361804] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20643.361807] qemu-img        D    0 226645 3613123 0x00000080
[20643.361813] Call Trace:
[20643.361834]  ? __schedule+0x2f3/0x750
[20643.361839]  ? bit_wait_timeout+0x90/0x90
[20643.361842]  schedule+0x39/0xa0
[20643.361846]  io_schedule+0x12/0x40
[20643.361850]  bit_wait_io+0xd/0x50
[20643.361853]  __wait_on_bit+0x2a/0x90
[20643.361857]  out_of_line_wait_on_bit+0x92/0xb0
[20643.361865]  ? var_wake_function+0x20/0x20
[20643.361872]  __block_write_begin_int+0x258/0x660
[20643.361878]  ? blkdev_direct_IO+0x4d0/0x4d0
[20643.361882]  ? blkdev_direct_IO+0x4d0/0x4d0
[20643.361885]  block_write_begin+0x48/0xf0
[20643.361893]  generic_perform_write+0xd1/0x1d0
[20643.361898]  __generic_file_write_iter+0xf9/0x1d0
[20643.361902]  blkdev_write_iter+0xd9/0x160
[20643.361911]  new_sync_write+0x16a/0x210
[20643.361916]  vfs_write+0xb6/0x1a0
[20643.361922]  ksys_pwrite64+0x65/0xa0
[20643.361930]  do_syscall_64+0x4e/0x140
[20643.361936]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20643.361942] RIP: 0033:0x7f7add9d807f
[20643.361954] Code: Bad RIP value.
[20643.361956] RSP: 002b:00007f7ad5fbb100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[20643.361959] RAX: ffffffffffffffda RBX: 00007f7ad6ebe000 RCX: 00007f7add9d807f
[20643.361961] RDX: 0000000000200000 RSI: 00007f7ad6ebe000 RDI: 000000000000000a
[20643.361963] RBP: 00007f7ad91d7790 R08: 0000000000000000 R09: 00000000ffffffff
[20643.361964] R10: 0000000bab81d200 R11: 0000000000000293 R12: 0000000000000000
[20643.361966] R13: 00007f7adcc52078 R14: 00007f7adcc8b7e0 R15: 00007f7adcc52000

And here even qemu exploded with segfaults:

[31253.939829] GPT:Primary header thinks Alt. header is not at the end of the disk.
[31253.939833] GPT:104857599 != 106954751
[31253.939835] GPT:Alternate GPT header not at the end of the disk.
[31253.939836] GPT:104857599 != 106954751
[31253.939838] GPT: Use GNU Parted to correct GPT errors.
[31253.939862]  zd160: p1 p2 p3 p4
[31552.474344] CPU 3/KVM[1202750]: segfault at 2 ip 0000559d22be5e50 sp 00007fd7b6875dd0 error 4 in qemu-system-x86_64[559d2299e000+524000]
[31552.474359] Code: 6b 00 5b c3 66 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a2 03 00 00 53 48 89 f3 48 8d 35 d3 a6 34 00 48 83 ec 78 <44> 0f b7 62 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 eb 75 34 00 45
[31552.474430] audit: type=1701 audit(1586633184.304:172): auid=4294967295 uid=65534 gid=992 ses=4294967295 pid=1202723 comm=43505520332F4B564D exe="/usr/bin/qemu-system-x86_64" sig=11 res=1
[31552.531843] audit: type=1130 audit(1586633184.364:173): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-3385535-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[31552.875387] audit: type=1131 audit(1586633184.704:174): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@1-3385535-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[31553.079592] device test123 left promiscuous mode
[31553.079637] audit: type=1700 audit(1586633184.904:175): dev=test123 prom=0 old_prom=256 auid=4294967295 uid=65534 gid=992 ses=4294967295
[31579.671578] INFO: task systemd-udevd:584669 blocked for more than 122 seconds.
[31579.671583]       Tainted: P           OE     5.4.31-1-lts #1
[31579.671584] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[31579.671587] systemd-udevd   D    0 584669      1 0x00000184
[31579.671593] Call Trace:
[31579.671614]  ? __schedule+0x2f3/0x750
[31579.671619]  schedule+0x39/0xa0
[31579.671623]  schedule_preempt_disabled+0xa/0x10
[31579.671628]  __mutex_lock.isra.0+0x19b/0x500
[31579.671638]  __blkdev_get+0x7f/0x550
[31579.671643]  ? blkdev_get_by_dev+0x50/0x50
[31579.671649]  do_dentry_open+0x13d/0x3a0
[31579.671654]  path_openat+0x5d0/0x1650
[31579.671666]  ? do_user_addr_fault+0x1e6/0x440
[31579.671669]  do_filp_open+0xab/0x120
[31579.671676]  do_sys_open+0x19e/0x240
[31579.671685]  do_syscall_64+0x4e/0x140
[31579.671692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[31579.671697] RIP: 0033:0x7fa20862b22b
[31579.671715] Code: Bad RIP value.
[31579.671717] RSP: 002b:00007ffd1aa14f60 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[31579.671720] RAX: ffffffffffffffda RBX: 00007ffd1aa15070 RCX: 00007fa20862b22b
[31579.671722] RDX: 00000000000a0800 RSI: 000055b7cc8bd500 RDI: 00000000ffffff9c
[31579.671724] RBP: 000055b7cc8bd500 R08: 00007fa2084ac5b0 R09: 0000000000000000
[31579.671726] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
[31579.671727] R13: 000055b7cc5c6000 R14: 000055b7cc8d0360 R15: 00007ffd1aa15038

the qemu-img command is:

qemu-img convert -p -O raw $qcow2_or_$vmdk /dev/zvol/$zpool/$zfsvolume

The volumes are created via:

zfs create -V $sizeG $zpool/$zfsvolume

Compression is on.

Inside of the guests, the party looks like this:

It would be quiet supernice, if someone could give a hint why this happens like this.

Its effectively reproduceable. The % probability is proportional to the IO happening on the server.

Running 2x qemu-img at once is basically a 100% gurantee that zfsvolumes will explode and the kvm qemu processes will die with that in segfault or how ever.

layer7gmbh commented 4 years ago

A simple:

zfs destroy kvm-storage/test3

will take

79 seconds

while running:

qemu-img convert -p -O raw test1-flat.vmdk /dev/zvol/kvm-storage/test1

while nothing else is running on this server.

Output iotop:

Total DISK READ :     112.22 M/s | Total DISK WRITE :     151.38 M/s
Actual DISK READ:     112.22 M/s | Actual DISK WRITE:      67.04 M/s
    PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                             
   1466 be/4 root        602.00 K      0.00 B  0.00 % 81.16 % [txg_sync]
  26982 be/4 root         14.46 G     14.65 G  0.00 % 59.11 % qemu-img convert -p -O raw test1-flat.vmdk /dev/zvol/kvm-storage/test1
    453 be/4 root          0.00 B      0.00 B  0.00 % 21.14 % [kworker/u65:8-events_freezable_power_]
    442 be/4 root          0.00 B      0.00 B  0.00 % 18.18 % [kworker/u65:4-events_power_efficient]
    709 be/4 root          0.00 B      0.00 B  0.00 % 15.21 % [l2arc_feed]
    444 be/4 root          0.00 B      0.00 B  0.00 % 12.27 % [kworker/u65:6-events_power_efficient]
      7 be/4 root          0.00 B      0.00 B  0.00 %  9.21 % [kworker/u65:0-events_freezable_power_]
   1154 be/0 root          0.00 B      0.00 B  0.00 %  8.38 % [z_null_iss]
 536113 be/4 root          0.00 B      0.00 B  0.00 % 13.29 % [kworker/u65:1-events_freezable_power_]
   1155 be/0 root          0.00 B      0.00 B  0.00 %  4.36 % [z_null_int]
    428 be/4 root          0.00 B      0.00 B  0.00 %  3.47 % [kworker/u65:2-events_unbound]
   1191 be/0 root          0.00 B      0.00 B  0.00 %  0.05 % [z_wr_int]
   1195 be/0 root          0.00 B      0.00 B  0.00 %  0.04 % [z_wr_int]
   1197 be/0 root          0.00 B      0.00 B  0.00 %  0.04 % [z_wr_int]
   1194 be/0 root          0.00 B      0.00 B  0.00 %  0.04 % [z_wr_int]
   1190 be/0 root          0.00 B      0.00 B  0.00 %  0.03 % [z_wr_int]
[...]

top will look like:

top - 10:58:02 up 11 min,  4 users,  load average: 42.91, 24.19, 10.25
Tasks: 527 total,   2 running, 525 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  8.9 sy,  0.0 ni, 70.2 id, 19.8 wa,  0.4 hi,  0.5 si,  0.0 st
MiB Mem : 193407.7 total, 102309.5 free,  23659.5 used,  67438.7 buff/cache
MiB Swap:  92649.0 total,  92649.0 free,      0.0 used. 168320.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                      
  26982 root      20   0  252192  34596   7540 S  29.5   0.0   1:18.38 qemu-img                         
[...]

40 load on a 32 core CPU with a single zfsvolume doing IO ?

layer7gmbh commented 4 years ago

So all in all it seems to me, that the zfs scheduler will leave zvol's back to die without any IO time for minuets, leading to this situations.

Howto force zfs to give at least a minimum IO time to everything ?

devZer0 commented 4 years ago

does this happen with zvols only? how does it behave with ordinary file, e.g. if you convert vmdk to zfs backed file?

also have a look at this one: https://github.com/openzfs/zfs/issues/7631

and maybe also this one: https://bugzilla.proxmox.com/show_bug.cgi?id=1453

devZer0 commented 4 years ago

oh, and what about this one ?

https://bugzilla.proxmox.com/show_bug.cgi?id=2624

layer7gmbh commented 4 years ago

This is:

HDD -> Zvol ( same process blocking multiple times )

[ 3194.406737] INFO: task qemu-img:27318 blocked for more than 245 seconds.
[ 3194.406741]       Tainted: P           OE     5.4.31-1-lts #1
[ 3194.406742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3194.406744] qemu-img        D    0 27318   6011 0x00004080
[ 3194.406750] Call Trace:
[ 3194.406765]  ? __schedule+0x2f3/0x750
[ 3194.406770]  ? bit_wait_timeout+0x90/0x90
[ 3194.406773]  schedule+0x39/0xa0
[ 3194.406777]  io_schedule+0x12/0x40
[ 3194.406780]  bit_wait_io+0xd/0x50
[ 3194.406783]  __wait_on_bit+0x2a/0x90
[ 3194.406787]  out_of_line_wait_on_bit+0x92/0xb0
[ 3194.406793]  ? var_wake_function+0x20/0x20
[ 3194.406799]  __block_write_begin_int+0x258/0x660
[ 3194.406805]  ? blkdev_direct_IO+0x4d0/0x4d0
[ 3194.406809]  ? blkdev_direct_IO+0x4d0/0x4d0
[ 3194.406812]  block_write_begin+0x48/0xf0
[ 3194.406818]  generic_perform_write+0xd1/0x1d0
[ 3194.406823]  __generic_file_write_iter+0xf9/0x1d0
[ 3194.406827]  blkdev_write_iter+0xd9/0x160
[ 3194.406834]  new_sync_write+0x16a/0x210
[ 3194.406840]  vfs_write+0xb6/0x1a0
[ 3194.406844]  ksys_pwrite64+0x65/0xa0
[ 3194.406851]  do_syscall_64+0x4e/0x140
[ 3194.406857]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3194.406862] RIP: 0033:0x7fe7fea0a07f
[ 3194.406871] Code: Bad RIP value.
[ 3194.406873] RSP: 002b:00007fe7f2ffc100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[ 3194.406876] RAX: ffffffffffffffda RBX: 00007fe7ef9ff000 RCX: 00007fe7fea0a07f
[ 3194.406878] RDX: 0000000000fffe00 RSI: 00007fe7ef9ff000 RDI: 000000000000000a
[ 3194.406880] RBP: 00007fe7fb0e66d0 R08: 0000000000000000 R09: 00000000ffffffff
[ 3194.406881] R10: 0000001dc5821000 R11: 0000000000000293 R12: 0000000000000000
[ 3194.406883] R13: 00007fe7fdc52078 R14: 00007fe7fdc8b8c0 R15: 00007fe7fdc52000

[ 3440.156743] INFO: task qemu-img:27318 blocked for more than 491 seconds.
[ 3440.156747]       Tainted: P           OE     5.4.31-1-lts #1
[ 3440.156748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3440.156750] qemu-img        D    0 27318   6011 0x00004080
[ 3440.156755] Call Trace:
[ 3440.156771]  ? __schedule+0x2f3/0x750
[ 3440.156776]  ? bit_wait_timeout+0x90/0x90
[ 3440.156779]  schedule+0x39/0xa0
[ 3440.156783]  io_schedule+0x12/0x40
[ 3440.156787]  bit_wait_io+0xd/0x50
[ 3440.156790]  __wait_on_bit+0x2a/0x90
[ 3440.156794]  out_of_line_wait_on_bit+0x92/0xb0
[ 3440.156800]  ? var_wake_function+0x20/0x20
[ 3440.156806]  __block_write_begin_int+0x258/0x660
[ 3440.156811]  ? blkdev_direct_IO+0x4d0/0x4d0
[ 3440.156816]  ? blkdev_direct_IO+0x4d0/0x4d0
[ 3440.156819]  block_write_begin+0x48/0xf0
[ 3440.156825]  generic_perform_write+0xd1/0x1d0
[ 3440.156830]  __generic_file_write_iter+0xf9/0x1d0
[ 3440.156834]  blkdev_write_iter+0xd9/0x160
[ 3440.156840]  new_sync_write+0x16a/0x210
[ 3440.156846]  vfs_write+0xb6/0x1a0
[ 3440.156851]  ksys_pwrite64+0x65/0xa0
[ 3440.156857]  do_syscall_64+0x4e/0x140
[ 3440.156863]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3440.156868] RIP: 0033:0x7fe7fea0a07f
[ 3440.156877] Code: Bad RIP value.
[ 3440.156879] RSP: 002b:00007fe7f2ffc100 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[ 3440.156882] RAX: ffffffffffffffda RBX: 00007fe7ef9ff000 RCX: 00007fe7fea0a07f
[ 3440.156884] RDX: 0000000000fffe00 RSI: 00007fe7ef9ff000 RDI: 000000000000000a
[ 3440.156886] RBP: 00007fe7fb0e66d0 R08: 0000000000000000 R09: 00000000ffffffff
[ 3440.156887] R10: 0000001dc5821000 R11: 0000000000000293 R12: 0000000000000000
[ 3440.156889] R13: 00007fe7fdc52078 R14: 00007fe7fdc8b8c0 R15: 00007fe7fdc52000

Happend maybe 4-5 times

With the same going to the zpool mountpoint /kvm-storage:

qemu-img convert -p -O test1-flat.vmdk /kvm-storage/test_1

qemu-img convert -p -O test2-flat.vmdk /kvm-storage/test_2

qemu-img convert -p -O test3-flat.vmdk /kvm-storage/test_3

i was not able to reproduce the issue.

So maybe its indeed something that goes for zvol's only.

@devZer0 THANK YOU very much for doing so great effort checking other sources for informations about this issue !

I read through all of them, and i will create now a /etc/modprobe.d/zfs.conf with:

zvol_threads=8 zvol_request_sync=1

And restart/continue my testing now.

layer7gmbh commented 4 years ago

The good news:

It feels like something happend ( i had to start 2x qemu-img convert + 1x file transfere to the mounted /kvm-storage zfs datastore ) -- before already 1-2x qemu-img convert were enough. So maybe, it got a bit more resilent...

The bad news:

 815650 be/4 root          2.31 G   1968.23 M  0.00 % 25.42 % qemu-img convert -p -O raw test1.vmdk /dev/zvol/kvm-storage/test1
1340209 be/4 root          9.53 G      9.39 G  0.00 % 25.31 % qemu-img convert -p -O raw test2.vmdk /dev/zvol/kvm-storage/test2

After 2-3 minuets:

 815650 be/4 root          2.31 G   1968.23 M  0.00 % 21.59 % qemu-img convert -p -O raw test1 /dev/zvol/kvm-storage/test1
1340209 be/4 root          9.53 G      9.39 G  0.00 % 19.65 % qemu-img convert -p -O raw test2/dev/zvol/kvm-storage/test2

... nothing moved.

And a VM looks like this:

BUT so far, no kernel dumps =)

So things are far away from being ok.

Zfs simply stops IO time to zvol's for too long. Have to find a way to force zfs to give at least a bit IO time to everyone, and not just let something have ZERO IO time for multiple minuets.

devZer0 commented 4 years ago

mhh ,weird.

i would try triggering that problem with an mostly incompressible and consistent "write only" load to see if it makes a difference. maybe qemu-img is doing something special...

maybe "shred /dev/zvol/kvm-storage/test1" or something like that.... (or disable compression and use dd if=/dev/zero of=/dev/zvol/kvm-storage/test1 bs=1024k... )

anyhow, it indeed it looks like some scheduling/starvation issue....

layer7gmbh commented 4 years ago

Ok,

i am not able to kill a single zvol IO ( fs check of qemu VM ) with 3x shred on zvol.

But it will report streigt frozen ... failed command: READ FPDMA QUEUED ... status: { DRDY } and the whole stuff with a single qemu-img. -- Intresting to mention, that it seems

1x qemu-img is going to kill things while

1x qemu-img AND 3x shred at the same time seems even after minuets not to kill things... maybe this qemu-img get so little read IO because of the shreds, that it actually does not get the chance to block things.

So this qemu-img does a pretty good job ( what ever it actually does ).

I will start now trying my luck with

https://github.com/openzfs/zfs/wiki/ZIO-Scheduler

devZer0 commented 4 years ago

i can easily reproduce your issue on my recent proxmox system with latest zfs 0.8.3

while issuing the following command on the proxmox host (filling the zvol of an offline vm):

dd if=/dev/urandom of=/dev/zvol/hddpool/vms/vm-111-disk-0 bs=1024k

shortly after, inside a kvm vm , the following command shows no read progress anymore:

while true;do dd if=/dev/sda bs=1024k |pv >/dev/null;done

in the end, inside vm i'm getting:

[sda] abort

i.e. i/o completely got stuck

what makes things even worse:

the vm is not using zvol on hddpool but on ssdpool, so apparently it's about i/o starvation in the kernel/zfs layer and not in the disk layer....

i would call this a serious issue (which is not about qemu-img at all).

not sure if kvm or zfs is to blame here...

please , can you have a look @behlendorf ?

layer7gmbh commented 4 years ago

yes, it does not matter what OS or version.

i tested here fedora 31/30, centos 7 and archlinux with versions between 0.7.x and latest 0.8.3 in different hardware setups ( raid controller, hba, nativ, jbod ).

So far its more and more safe to say:

qemu-img on datasets work without any issues. qemu-img on zvol will most probably cause issues ( except you just have some tiny amounts of data ( 0-10 GB images ).

I tried this settings in different combinations and numbers:

zfs zfs_vdev_sync_write_max_active=5
zfs_vdev_sync_write_min_active=2
zfs_vdev_sync_read_min_active=20
zfs_vdev_sync_read_max_active=30
zfs zfs_vdev_async_write_max_active=5
zfs zfs_vdev_async_write_min_active=2
zfs zfs_vdev_async_read_min_active=20
zfs zfs_vdev_async_read_max_active=30
zfs zvol_request_sync=1
zfs zfs_vdev_async_write_active_max_dirty_percent=5
zfs_vdev_async_write_active_min_dirty_percent=1

aswell as:

vm.dirty_ratio = 5
vm.dirty_background_ratio=1
vm.vfs_cache_pressure = 500
vm.dirtytime_expire_seconds = 20
vm.swappiness = 0

Sometimes the error comes earlier, sometimes it ( seems ) to come later, while there is no 200% failsave way to reproduce this, as sometimes it might work, most times not ( and if not then sometimes earlier or later not ).

So far the work around:

Dont use qemu-img with zvol's

ionice wont help too by the way.

devZer0 commented 4 years ago

i have found with further testing that this seems completely unrelated to kvm.

i also get completely stalling read from zvol on ssd when doing large streaming write to zvol on hdd, i.e. :

dd if=/dev/urandom bs=1024k |pv > /dev/zvol/hddpool/zvol1

makes

dd if=/dev/zvol/ssdpool/zvol1 bs=1024k |pv >/dev/null

hang completely

furthermore, after about 4-5gb of writes, performance drops significantly, by at least 3/4 (from >120mb/s to <30mb/s) - even without reading in parallel.

layer7gmbh commented 4 years ago

yes, thats indeed another issue.

But honestly i am already quiet happy i found the cause of this segfaults and qemu-kvm crash's.

But you are right, zfs's zvol code, at least here for linux seems to have quiet some space for improvement.

while with qemu-img something happen to zvol, that things go down the hill, while with dd it does not lead to crashes ( while the performance might still feels like my 3,5" floppy ).

devZer0 commented 4 years ago

while with qemu-img something happen to zvol, that things go down the hill, while with dd it does not lead to crashes

i think this is just a matter of difference in write performance/volume.

try disabling compression on the zvol and try writing with larger blocksize (dd if=/dev/zero bs=1024k....) to it, i bet you will see the same issue like with qemu-img

btw, i can NOT reproduce the issue with blocksize=1k, as this won't fully saturate i/o - but with bs=8k i can easily trigger it.

btw, on my system ashift for hddpool is at 9 and for ssdpool is at 12.

layer7gmbh commented 4 years ago

yeah, well unfortunatelly you are right, if the IO just takes long enough

dd if=raw of=zvol bs=4k conv=fdatasync,sparse

things will go down the hill with 500 GB of raw data.

During the dd, two random KVM's died....

[177814.217766] CPU 1/KVM[3965349]: segfault at 2 ip 000055dfa44a7e50 sp 00007fe18f1e5dd0 error 4 in qemu-system-x86_64[55dfa4260000+524000]
[177814.217777] Code: 6b 00 5b c3 66 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a2 03 00 00 53 48 89 f3 48 8d 35 d3 a6 34 00 48 83 ec 78 <44> 0f b7 62 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 eb 75 34 00 45

[177854.346110] CPU 1/KVM[263273]: segfault at 2 ip 0000558c0b258e50 sp 00007fcc10d67dd0 error 4 in qemu-system-x86_64[558c0b011000+524000]
[177854.346125] Code: 6b 00 5b c3 66 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 cd b9 a2 03 00 00 53 48 89 f3 48 8d 35 d3 a6 34 00 48 83 ec 78 <44> 0f b7 62 02 4c 8b 7a 08 4c 89 44 24 10 4c 8d 05 eb 75 34 00 45

So we can consider zvol's close to unuseable for really reliable stuff. Back to lvm i guess... super sad.

scineram commented 4 years ago

Could you test with https://github.com/openzfs/zfs/commit/0929c4de398606f8305057ca540cf577e6771c30 and https://github.com/openzfs/zfs/commit/20f287855a61a64a90e01edfd551ff28f7abdb5c?

devZer0 commented 4 years ago

here we go.

unfortunately, i don't see significant improvement.

compared to file or lvm, writes to zvol are slow, while reads severely starve

Test System:
platform:  kvm/proxmox virtual machine backed by 500gb vdisk/raw file on zfs with lz4 compression. 
hardware:  fujitsu rx300 s6 / 1x xeon L5630

guest os distro: CentOS Linux release 7.7.1908 (Core)
guest os kernel : 3.10.0-1062.18.1.el7.x86_64
guest os zpool:  no raid or anything, just simple testpool on 500gb vdisk ( /dev/sdc )

test for reading:
# while true;do dd if=$path1 bs=1024k |pv -i60 >/dev/null ;done

test for writing:
# while true;do dd if=/dev/zero bs=1024k |pv -i60 >$path2 ;done

(start test in 2 terminal windows concurrently)

read    path1                    write   path2                     backend
287MB/s /dev/vg/lv1              360MB/s /dev/vg/lv2               lvm logical volume
950kB/s /dev/zvol/testpool/zvol1  40MB/s /dev/zvol/testpool/zvol2  zfs v0.8.3-1 zvol w/o compr 
240MB/s /testpool/file1          230MB/s /testpool/file2           zfs v0.8.3-1 file w/o compr 
1,2MB/s /dev/zvol/testpool/zvol1  55MB/s /dev/zvol/testpool/zvol2  zfs v0.8.0-728_ga7929f3 zvol w/o compr
230MB/s /testpool/file1          230MB/s /testpool/file2           zfs v0.8.0-728_ga7929f3 file w/o compr

devsnowy commented 4 years ago

Hi guys, I'm not sure if this is related but I believe I'm in the same situation. I have latest versions:

proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve) pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754) pve-kernel-5.4: 6.2-2 pve-kernel-helper: 6.2-2 pve-kernel-5.4.41-1-pve: 5.4.41-1 pve-kernel-5.4.34-1-pve: 5.4.34-2 zfs-0.8.4-pve1 zfs-kmod-0.8.4-pve1

I'm getting crashes only after high load. It usually hangs out the VMs entierly and the zpools traffic. I can replicate this by just installing proxmox, making a zfs mirror pool, one vm created and after installing normally any operating system copying a file would work fine until you copy it multiple times and it gets frozen from high speeds to 0 bytes.

The host has 128GB RAM, and a Threadripper; so I don't think there is any memory or cpu limitation involved.

I've found out this crash in syslog: Jun 10 02:18:48 pve kernel: INFO: task txg_sync:61435 blocked for more than 120 seconds. Jun 10 02:18:48 pve kernel: Tainted: P OE 5.4.41-1-pve #1 Jun 10 02:18:48 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 02:18:48 pve kernel: txg_sync D 0 61435 2 0x80004000 Jun 10 02:18:48 pve kernel: Call Trace: Jun 10 02:18:48 pve kernel: schedule+0x2e6/0x6f0 Jun 10 02:18:48 pve kernel: schedule+0x33/0xa0 Jun 10 02:18:48 pve kernel: schedule_timeout+0x152/0x300 Jun 10 02:18:48 pve kernel: ? next_timer_interrupt+0xd0/0xd0 Jun 10 02:18:48 pve kernel: io_schedule_timeout+0x1e/0x50 Jun 10 02:18:48 pve kernel: cv_timedwait_common+0x12f/0x170 [spl] Jun 10 02:18:48 pve kernel: ? wait_woken+0x80/0x80 Jun 10 02:18:48 pve kernel: cv_timedwait_io+0x19/0x20 [spl] Jun 10 02:18:48 pve kernel: zio_wait+0x130/0x270 [zfs] Jun 10 02:18:48 pve kernel: vdev_uberblock_sync_list+0x91/0x130 [zfs] Jun 10 02:18:48 pve kernel: vdev_config_sync+0x157/0x240 [zfs] Jun 10 02:18:48 pve kernel: spa_sync+0xbd8/0xfa0 [zfs] Jun 10 02:18:48 pve kernel: txg_sync_thread+0x2d9/0x4c0 [zfs] Jun 10 02:18:48 pve kernel: ? txg_thread_exit.isra.12+0x60/0x60 [zfs] Jun 10 02:18:48 pve kernel: thread_generic_wrapper+0x74/0x90 [spl] Jun 10 02:18:48 pve kernel: kthread+0x120/0x140 Jun 10 02:18:48 pve kernel: ? thread_exit+0x20/0x20 [spl] Jun 10 02:18:48 pve kernel: ? kthread_park+0x90/0x90 Jun 10 02:18:48 pve kernel: ret_from_fork+0x22/0x40 Jun 10 02:18:48 pve kernel: INFO: task kvm:90079 blocked for more than 120 seconds. Jun 10 02:18:48 pve kernel: Tainted: P OE 5.4.41-1-pve #1 Jun 10 02:18:48 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 02:18:48 pve kernel: kvm D 0 90079 1 0x00004000 Jun 10 02:18:48 pve kernel: Call Trace: Jun 10 02:18:48 pve kernel: __schedule+0x2e6/0x6f0 Jun 10 02:18:48 pve kernel: schedule+0x33/0xa0 Jun 10 02:18:48 pve kernel: cv_wait_common+0x104/0x130 [spl] Jun 10 02:18:48 pve kernel: ? wait_woken+0x80/0x80 Jun 10 02:18:48 pve kernel: cv_wait+0x15/0x20 [spl] Jun 10 02:18:48 pve kernel: zil_commit_impl+0x241/0xd90 [zfs] Jun 10 02:18:48 pve kernel: zil_commit+0x3d/0x60 [zfs] Jun 10 02:18:48 pve kernel: zvol_request+0x233/0x3c0 [zfs] Jun 10 02:18:48 pve kernel: generic_make_request+0xcf/0x310 Jun 10 02:18:48 pve kernel: submit_bio+0x46/0x1c0 Jun 10 02:18:48 pve kernel: ? radix_tree_lookup+0xd/0x10 Jun 10 02:18:48 pve kernel: ? blkg_lookup_slowpath+0x27/0x50 Jun 10 02:18:48 pve kernel: submit_bio_wait+0x59/0x90 Jun 10 02:18:48 pve kernel: blkdev_issue_flush+0x8e/0xc0 Jun 10 02:18:48 pve kernel: blkdev_fsync+0x35/0x50 Jun 10 02:18:48 pve kernel: vfs_fsync_range+0x48/0x80 Jun 10 02:18:48 pve kernel: ? fget_light+0x59/0x70 Jun 10 02:18:48 pve kernel: do_fsync+0x3d/0x70 Jun 10 02:18:48 pve kernel: x64_sys_fdatasync+0x17/0x20 Jun 10 02:18:48 pve kernel: do_syscall_64+0x57/0x190 Jun 10 02:18:48 pve kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 10 02:18:48 pve kernel: RIP: 0033:0x7ff1985e02e7 Jun 10 02:18:48 pve kernel: Code: Bad RIP value. Jun 10 02:18:48 pve kernel: RSP: 002b:00007fef758324c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b Jun 10 02:18:48 pve kernel: RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00007ff1985e02e7 Jun 10 02:18:48 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000011 Jun 10 02:18:48 pve kernel: RBP: 00007ff18b66ca10 R08: 0000000000000000 R09: 00000000ffffffff Jun 10 02:18:48 pve kernel: R10: 00007fef758324b0 R11: 0000000000000293 R12: 000055f07aa18e12 Jun 10 02:18:48 pve kernel: R13: 00007ff18b66ca78 R14: 00007ff18b0dd180 R15: 00007ff18b139010 Jun 10 02:18:48 pve kernel: INFO: task kvm:60453 blocked for more than 120 seconds. Jun 10 02:18:48 pve kernel: Tainted: P OE 5.4.41-1-pve #1 Jun 10 02:18:48 pve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 10 02:18:48 pve kernel: kvm D 0 60453 1 0x00004000 Jun 10 02:18:48 pve kernel: Call Trace: Jun 10 02:18:48 pve kernel: schedule+0x2e6/0x6f0 Jun 10 02:18:48 pve kernel: schedule+0x33/0xa0 Jun 10 02:18:48 pve kernel: cv_wait_common+0x104/0x130 [spl] Jun 10 02:18:48 pve kernel: ? wait_woken+0x80/0x80 Jun 10 02:18:48 pve kernel: __cv_wait+0x15/0x20 [spl] Jun 10 02:18:48 pve kernel: zil_commit_impl+0x241/0xd90 [zfs] Jun 10 02:18:48 pve kernel: zil_commit+0x3d/0x60 [zfs] Jun 10 02:18:48 pve kernel: zvol_request+0x233/0x3c0 [zfs] Jun 10 02:18:48 pve kernel: generic_make_request+0xcf/0x310 Jun 10 02:18:48 pve kernel: submit_bio+0x46/0x1c0 Jun 10 02:18:48 pve kernel: ? radix_tree_lookup+0xd/0x10 Jun 10 02:18:48 pve kernel: ? blkg_lookup_slowpath+0x27/0x50 Jun 10 02:18:48 pve kernel: submit_bio_wait+0x59/0x90 Jun 10 02:18:48 pve kernel: blkdev_issue_flush+0x8e/0xc0 Jun 10 02:18:48 pve kernel: blkdev_fsync+0x35/0x50 Jun 10 02:18:48 pve kernel: vfs_fsync_range+0x48/0x80 Jun 10 02:18:48 pve kernel: ? fget_light+0x59/0x70 Jun 10 02:18:48 pve kernel: do_fsync+0x3d/0x70 Jun 10 02:18:48 pve kernel: __x64_sys_fdatasync+0x17/0x20 Jun 10 02:18:48 pve kernel: do_syscall_64+0x57/0x190 Jun 10 02:18:48 pve kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 10 02:18:48 pve kernel: RIP: 0033:0x7f035929a2e7 Jun 10 02:18:48 pve kernel: Code: Bad RIP value. Jun 10 02:18:48 pve kernel: RSP: 002b:00007f034b5794c0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b Jun 10 02:18:48 pve kernel: RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00007f035929a2e7 Jun 10 02:18:48 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000011 Jun 10 02:18:48 pve kernel: RBP: 00007f034c46ca10 R08: 0000000000000000 R09: 00000000ffffffff Jun 10 02:18:48 pve kernel: R10: 00007f034b5794b0 R11: 0000000000000293 R12: 000055fd70597e12 Jun 10 02:18:48 pve kernel: R13: 00007f034c46ca78 R14: 00007f034bedce00 R15: 00007f034beac890

bland328 commented 3 years ago

FWIW, just in case it inspires anyone's thinking on this:

On a Slackware box with XFS (not ZFS!), under the heaviest of I/O (e.g. multiple rsyncs running), I occasionally see the same KVM...segfault at 2...error 4 failure.

And when it happens, it eats the partition table for the VM's boot "disk" so regularly that I have a dd-based script standing by to replace the first "sector" of the raw disk image file.

I don't, however, see any corruption of the host XFS filesystem.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

devZer0 commented 2 years ago

@layer7gmbh , what type of kvm host is this? it seems it's NOT proxmox based, but most (if not all, besides this one) of the reports i see on lost partition tables are in the proxmox forum/community.

is vzdump being involved in the process of loosing partition?

anyhow, i think it's not a zfs issue but a kvm/qemu issue: https://bugzilla.proxmox.com/show_bug.cgi?id=2874

devZer0 commented 2 years ago

@layer7gmbh , can you provide some feedback so we can decide if this needs to be left open ?

devZer0 commented 2 years ago

regarding hang of virtual machine there is some interesting findings at https://bugzilla.proxmox.com/show_bug.cgi?id=1453 , but not sure if it's relevant here

devZer0 commented 2 years ago

@layer7gmbh , does this problem still exist ?

devZer0 commented 1 year ago

@layer7gmbh , you seem to use sata disk in virtual machine configuration.

can you confirm that loosing partition table only happened in virtual machines with virtual sata disk?

apparently, it seems there is an issue with that in qemu: https://bugzilla.proxmox.com/show_bug.cgi?id=2874#c51

layer7gmbh commented 1 year ago

@devZer0

probably this kind of issues were not ( only ) related to SATA. Also according to your qemu post they talk about unclean shutdown situations where partitiontables might get lost. That was not the case here.

Unfortunatelly i am not able to tell you much more. With our current solution:

Kernel 5.15.83-1-pve zfs-2.1.7-pve2 zfs-kmod-2.1.7-pve1

It seems we dont run into this kind of issues.

devZer0 commented 1 year ago

the sector 0 corruption bug is found and being fixed (related to sata/ida & backup interruption), see https://bugzilla.proxmox.com/show_bug.cgi?id=2874#c51

did it happen again for you?

if not, i think we can close this issue !? would you be so kind @layer7gmbh ?

openzfs / zfs

zvol loose partitiontables in high IO situations -- also crashes qemu kvm #10095

zfs destroy kvm-storage/test3

qemu-img convert -p -O raw test1-flat.vmdk /dev/zvol/kvm-storage/test1

qemu-img convert -p -O test1-flat.vmdk /kvm-storage/test_1

qemu-img convert -p -O test2-flat.vmdk /kvm-storage/test_2

qemu-img convert -p -O test3-flat.vmdk /kvm-storage/test_3