openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.33k stars 1.72k forks source link

txg_sync @ constant high I/O usage (drive operations pretty slow) #5488

Closed tsch157 closed 5 years ago

tsch157 commented 7 years ago

I´m having an issue with ZFS Performance on the host. There´s a high I/O load on the system causing a high system load and slow drive operations. This I/O load is done by txg_sync which is constantly running with a big I/O demand.

Installing an OS or performing other drive operations on the VM´s running on the system takes a real long time.

I´m on Debian 8 64bit with Linux kernel version 4.4.6-1-pve, ZFS Version 0.6.5.6-1 and SPL Version 0.6.5.6-349_ge317290

Here is some information which may be helpful to track the issue:

cat /proc/spl/kstat/zfs/arcstats 6 1 0x01 91 4368 5101900648 7044332794010130 name type data hits 4 1994624869 misses 4 186895093 demand_data_hits 4 1785114934 demand_data_misses 4 64731285 demand_metadata_hits 4 110871697 demand_metadata_misses 4 26483558 prefetch_data_hits 4 98601093 prefetch_data_misses 4 95642996 prefetch_metadata_hits 4 37145 prefetch_metadata_misses 4 37254 mru_hits 4 494863462 mru_ghost_hits 4 36625896 mfu_hits 4 1428369765 mfu_ghost_hits 4 11053931 deleted 4 196264504 mutex_miss 4 28692 evict_skip 4 134361520 evict_not_enough 4 2871842 evict_l2_cached 4 20446992813056 evict_l2_eligible 4 2545214473728 evict_l2_ineligible 4 6202718568448 evict_l2_skip 4 1402 hash_elements 4 966046 hash_elements_max 4 4319663 hash_collisions 4 53810532 hash_chains 4 13762 hash_chain_max 4 6 p 4 7824013824 c 4 8589934592 c_min 4 8589934592 c_max 4 34359738368 size 4 8589892872 hdr_size 4 80194992 data_size 4 8334737408 metadata_size 4 79804416 other_size 4 17998248 anon_size 4 15419392 anon_evictable_data 4 0 anon_evictable_metadata 4 0 mru_size 4 7554035712 mru_evictable_data 4 7492468736 mru_evictable_metadata 4 827392 mru_ghost_size 4 1022669312 mru_ghost_evictable_data 4 668729344 mru_ghost_evictable_metadata 4 353939968 mfu_size 4 845086720 mfu_evictable_data 4 830210048 mfu_evictable_metadata 4 65536 mfu_ghost_size 4 7537443328 mfu_ghost_evictable_data 4 6818496512 mfu_ghost_evictable_metadata 4 718946816 l2_hits 4 68646059 l2_misses 4 118248983 l2_feeds 4 5607476 l2_rw_clash 4 327 l2_read_bytes 4 3086656800768 l2_write_bytes 4 25263423164928 l2_writes_sent 4 5560770 l2_writes_done 4 5560770 l2_writes_error 4 0 l2_writes_lock_retry 4 934 l2_evict_lock_retry 4 155 l2_evict_reading 4 0 l2_evict_l1cached 4 37873288 l2_free_on_write 4 109847 l2_cdata_free_on_write 4 240697 l2_abort_lowmem 4 1433510 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 111879885824 l2_asize 4 65552230912 l2_hdr_size 4 77157808 l2_compress_successes 4 274676202 l2_compress_zeros 4 0 l2_compress_failures 4 112201835 memory_throttle_count 4 0 duplicate_buffers 4 0 duplicate_buffers_size 4 0 duplicate_reads 4 0 memory_direct_count 4 37883421381 memory_indirect_count 4 265110432 arc_no_grow 4 1 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 0 arc_meta_used 4 255155464 arc_meta_limit 4 34359738368 arc_meta_max 4 1155483648 arc_meta_min 4 16777216 arc_need_free 4 0 arc_sys_free 4 3169345536

slabtop -o Active / Total Objects (% used) : 63148286 / 64802795 (97.4%) Active / Total Slabs (% used) : 1010571 / 1010571 (100.0%) Active / Total Caches (% used) : 94 / 133 (70.7%) Active / Total Size (% used) : 4208000.30K / 4568901.91K (92.1%) Minimum / Average / Maximum Object : 0.01K / 0.07K / 16.62K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 55799360 55798441 23% 0.06K 871865 64 3487460K kmalloc-64 4783885 4754196 99% 0.05K 56281 85 225124K ftrace_event_field 1465856 924743 63% 0.03K 11452 128 45808K kmalloc-32 1002197 225898 22% 0.32K 20453 49 327248K taskstats 830037 742791 89% 0.10K 21283 39 85132K blkdev_ioc 233142 96969 41% 0.09K 5551 42 22204K kmalloc-96 62304 59284 95% 0.16K 1298 48 10384K kvm_mmu_page_header 59675 58552 98% 0.29K 1085 55 17360K dmu_buf_impl_t 52020 52020 100% 0.12K 1530 34 6120K kernfs_node_cache 47936 36899 76% 0.57K 1712 28 27392K radix_tree_node 44896 17113 38% 0.12K 1403 32 5612K kmalloc-128 38760 34382 88% 0.20K 969 40 7752K vm_area_struct 37587 34458 91% 0.08K 737 51 2948K anon_vma 34062 31370 92% 0.19K 811 42 6488K dentry 26520 26010 98% 0.04K 260 102 1040K ext4_extent_status 25404 25404 100% 0.05K 348 73 1392K Acpi-Parse 20520 8939 43% 4.00K 2565 8 82080K kmalloc-4096 19312 13027 67% 0.25K 614 32 4912K kmalloc-256 18944 18944 100% 0.01K 37 512 148K kmalloc-8 16170 11886 73% 0.19K 385 42 3080K kmalloc-192 13940 13709 98% 0.61K 997 52 31904K proc_inode_cache 13888 11745 84% 0.50K 434 32 6944K kmalloc-512 13048 12837 98% 0.55K 466 28 7456K inode_cache 11264 11024 97% 0.02K 44 256 176K kmalloc-16 9789 9505 97% 0.10K 251 39 1004K buffer_head 9376 8921 95% 1.00K 293 32 9376K kmalloc-1024 9240 9240 100% 0.07K 165 56 660K ext4_io_end 7582 7548 99% 0.12K 223 34 892K jbd2_journal_head 7209 6507 90% 0.62K 144 51 4608K sock_inode_cache 5198 4986 95% 16.00K 3165 2 101280K zio_buf_16384 5082 5006 98% 0.38K 121 42 1936K mnt_cache 4998 4998 100% 0.64K 102 49 3264K shmem_inode_cache 4730 4469 94% 0.36K 115 44 1840K blkdev_requests 4482 4229 94% 0.94K 153 34 4896K mm_struct 3776 3776 100% 0.06K 59 64 236K ext4_free_data 3637 2890 79% 1.01K 167 31 5344K ext4_inode_cache 3496 3496 100% 0.09K 76 46 304K trace_event_file 3200 3200 100% 0.03K 25 128 100K jbd2_revoke_record_s 3117 3117 100% 1.06K 112 30 3584K signal_cache 2597 2597 100% 0.32K 53 49 848K request_sock_TCP 2494 2349 94% 0.27K 86 29 688K tw_sock_TCP

cat /proc/meminfo MemTotal: 198084236 kB MemFree: 7122860 kB MemAvailable: 7090324 kB Buffers: 684 kB Cached: 238816 kB SwapCached: 67728 kB Active: 164673728 kB Inactive: 9674812 kB Active(anon): 164670364 kB Inactive(anon): 9666648 kB Active(file): 3364 kB Inactive(file): 8164 kB Unevictable: 32552 kB Mlocked: 32552 kB SwapTotal: 2961476 kB SwapFree: 0 kB Dirty: 20 kB Writeback: 0 kB AnonPages: 174077316 kB Mapped: 58812 kB Shmem: 221976 kB Slab: 4585124 kB SReclaimable: 64120 kB SUnreclaim: 4521004 kB KernelStack: 23312 kB PageTables: 517800 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 99042116 kB Committed_AS: 423293920 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 170108 kB DirectMap2M: 4014080 kB DirectMap1G: 198180864 kB

iotop -oP -d 5

Total DISK READ : 6.55 M/s | Total DISK WRITE : 508.57 K/s Actual DISK READ: 5.09 M/s | Actual DISK WRITE: 9.46 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 2437 be/4 root 0.00 B/s 0.00 B/s 0.00 % 54.41 % [txg_sync] 23416 be/4 root 0.00 B/s 0.00 B/s 0.00 % 6.89 % [kworker/u65:1] 14613 be/4 root 0.00 B/s 0.00 B/s 0.00 % 5.58 % [kworker/u65:3] 25144 be/4 root 0.00 B/s 122.68 K/s 0.00 % 3.34 % kvm -id xyz 977 be/4 root 0.00 B/s 0.00 B/s 0.00 % 2.20 % [l2arc_feed] 14098 be/4 root 0.00 B/s 4.66 K/s 0.00 % 0.66 % kvm -id xyz 436 ?dif root 0.00 B/s 3.11 K/s 0.00 % 0.29 % kvm -id xyz 26300 be/4 root 2.58 M/s 135.88 K/s 0.00 % 0.28 % kvm -id xyz 16659 be/4 root 0.00 B/s 13.20 K/s 0.00 % 0.25 % kvm -id xyz 21746 be/4 root 0.00 B/s 44.26 K/s 0.00 % 0.21 % kvm -id xyz 24411 be/4 root 107.93 K/s 0.00 B/s 0.00 % 0.19 % swapoff -a 6456 be/4 root 2.91 K/s 13.98 K/s 0.00 % 0.18 % kvm -id xyz 22646 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.13 % [kworker/u66:4] 2839 be/4 root 77.64 K/s 0.00 B/s 0.00 % 0.11 % systemd-logind 25610 be/4 root 0.00 B/s 795.08 B/s 0.00 % 0.06 % kvm -id xyz 30643 be/4 root 0.00 B/s 13.98 K/s 0.00 % 0.04 % kvm -id xyz 13458 be/4 root 12.42 K/s 6.99 K/s 0.00 % 0.02 % kvm -id xyz 9283 be/4 root 1391.39 B/s 3.88 K/s 0.00 % 0.02 % kvm -id xyz 21607 be/4 root 0.00 B/s 3.11 K/s 0.00 % 0.02 % kvm -id xyz 9421 be/4 root 795.08 B/s 0.00 B/s 0.00 % 0.01 % [kworker/u66:3] 2870 be/4 root 33.39 K/s 13.20 K/s 0.00 % 0.01 % rrdcached -l unix:/var/run/rrdcached.sock -j /var/lib/rrdcached~urnal/ -F -b /var/lib/rrdcached/db/ -B -p /var/run/rrdcached.pid 16837 be/4 root 37.27 K/s 11.65 K/s 0.00 % 0.01 % kvm -id xyz 431 be/3 root 0.00 B/s 7.76 K/s 0.00 % 0.00 % [jbd2/dm-0-8] 23976 be/4 root 993.85 B/s 12.42 K/s 0.00 % 0.00 % kvm -id xyz 31862 be/4 root 3.30 K/s 20.19 K/s 0.00 % 0.00 % kvm -id xyz 25803 be/4 root 3.30 K/s 5.44 K/s 0.00 % 0.00 % kvm -id xyz 1455 ?dif root 795.08 B/s 0.00 B/s 0.00 % 0.00 % dmeventd -f 2975 be/4 root 2.33 K/s 6.99 K/s 0.00 % 0.00 % pmxcfs 2048 be/0 root 24.85 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_7] 446 be/4 root 0.00 B/s 6.99 K/s 0.00 % 0.00 % kvm -id xyz 1995 be/0 root 70.66 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_3] 27324 be/4 root 0.00 B/s 42.70 K/s 0.00 % 0.00 % kvm -id xyz 2043 be/0 root 49.69 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_7] 2047 be/0 root 43.48 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_7] 2007 be/0 root 24.85 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_4] 2046 be/0 root 49.69 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_7] 2021 be/0 root 24.85 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_5] 14899 be/4 root 0.00 B/s 4.66 K/s 0.00 % 0.00 % kvm -id xyz 2041 be/0 root 24.85 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_7] 1958 be/0 root 24.85 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_0] 23996 be/4 root 0.00 B/s 795.08 B/s 0.00 % 0.00 % kvm -id xyz 1977 be/0 root 25.62 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_2] 2039 be/0 root 55.90 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_7] 26719 be/4 root 0.00 B/s 3.11 K/s 0.00 % 0.00 % kvm -id xyz 2033 be/0 root 54.35 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_6] 1985 be/0 root 24.85 K/s 0.00 B/s 0.00 % 0.00 % [z_rd_int_2]

iostat -mx 12/15/2016 _x8664 (24 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle 36.64 0.00 12.60 3.40 0.00 47.35

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.19 1.69 5.23 1.98 0.03 0.02 13.43 0.11 15.82 0.61 55.89 0.34 0.24 sdb 0.00 0.00 4.67 66.70 0.13 2.08 63.46 0.53 7.36 27.49 5.95 3.91 27.90 sdg 0.00 0.00 4.74 65.54 0.13 2.07 64.09 0.57 8.07 29.52 6.52 4.21 29.57 sdh 0.00 0.00 4.65 66.67 0.13 2.08 63.48 0.50 6.99 25.78 5.68 3.79 27.02 sdf 0.00 0.00 4.52 65.51 0.12 2.07 64.02 0.03 0.46 35.28 7.37 4.58 32.08 sdd 0.00 0.00 5.04 66.57 0.14 2.09 63.78 0.52 7.22 28.58 5.60 3.83 27.45 sde 0.00 0.00 4.26 66.49 0.11 2.09 63.50 0.35 4.93 50.61 2.00 6.18 43.74 sdc 0.00 0.00 11.77 36.49 0.43 3.42 163.24 0.04 0.90 0.65 0.98 0.83 4.01 dm-0 0.00 0.00 0.42 2.61 0.02 0.02 21.48 0.19 63.56 4.24 73.11 0.20 0.06 dm-1 0.00 0.00 0.70 0.68 0.00 0.00 8.00 0.54 391.79 1.57 798.62 0.52 0.07 dm-2 0.00 0.00 0.08 0.40 0.00 0.00 8.00 0.00 0.66 0.29 0.74 0.66 0.03

(Linux swap function caused this wait on dm-1, that is fixed already.)

zpool iostat -v 5 5 capacity operations bandwidth pool alloc free read write read write


vz 1.80T 944G 15 246 700K 5.85M mirror 613G 315G 5 82 234K 1.95M sdd - - 2 60 140K 2.09M sde - - 2 60 102K 2.09M mirror 613G 315G 5 82 233K 1.95M sdb - - 2 60 121K 2.08M sdh - - 2 60 120K 2.08M mirror 613G 315G 5 81 233K 1.94M sdg - - 2 59 125K 2.07M sdf - - 2 59 115K 2.07M cache - - - - - - sdc 59.6G 52.2G 9 36 428K 3.42M


           capacity     operations    bandwidth

pool alloc free read write read write


vz 1.80T 944G 0 182 5.10K 3.25M mirror 613G 315G 0 55 5.10K 1000K sdd - - 0 59 5.60K 1.22M sde - - 0 55 0 1.07M mirror 613G 315G 0 68 0 1.20M sdb - - 0 37 0 1.31M sdh - - 0 40 0 1.31M mirror 613G 315G 0 57 0 1.07M sdg - - 0 58 0 1.17M sdf - - 0 58 0 1.17M cache - - - - - - sdc 59.6G 52.2G 1 25 33.8K 1.81M


           capacity     operations    bandwidth

pool alloc free read write read write


vz 1.80T 944G 0 245 0 2.19M mirror 613G 315G 0 83 0 806K sdd - - 0 81 0 783K sde - - 0 84 0 936K mirror 613G 315G 0 86 0 800K sdb - - 0 60 0 944K sdh - - 0 54 0 944K mirror 613G 315G 0 75 0 639K sdg - - 0 77 0 758K sdf - - 0 77 0 758K cache - - - - - - sdc 59.6G 52.2G 0 27 12.5K 1.42M


           capacity     operations    bandwidth

pool alloc free read write read write


vz 1.80T 944G 0 238 5.10K 5.04M mirror 613G 315G 0 62 2.80K 1.61M sdd - - 0 66 0 1.72M sde - - 0 62 3.20K 1.70M mirror 613G 315G 0 91 0 2.00M sdb - - 0 50 0 2.14M sdh - - 0 48 0 2.14M mirror 613G 315G 0 84 2.30K 1.43M sdg - - 0 83 2.40K 1.57M sdf - - 0 84 0 1.57M cache - - - - - - sdc 59.6G 52.2G 1 39 33.4K 2.24M


           capacity     operations    bandwidth

pool alloc free read write read write


vz 1.80T 944G 0 303 102 7.73M mirror 613G 315G 0 94 0 2.44M sdd - - 0 90 0 2.57M sde - - 0 94 0 2.59M mirror 613G 315G 0 106 0 2.56M sdb - - 0 53 0 2.73M sdh - - 0 56 0 2.73M mirror 613G 315G 0 102 102 2.73M sdg - - 0 102 0 2.89M sdf - - 0 101 819 2.89M cache - - - - - - sdc 59.6G 52.2G 0 18 19.2K 1.39M


I´d be really grateful for help. Thanks alot in advance!

kernelOfTruth commented 7 years ago

looks more like a mailing list or IRC question ...

anyway:

/sys/module/zfs/parameters/zfs_txg_timeout has a default of 5 (5 seconds),

either try to lower or raise (potential more data loss the higher) that value

make sure that for queue/nr_requests you have a value of at least 2xqueue_depth of your controller set (some guides suggest 32768 for very busy servers)

Try lowering ARC size to ~ 40-45% of RAM size,

also update zfs and spl modules to latest version.

Do you observe only high i/o or also high cpu load and latency ?

Is your swap really full ? You need a bigger one ;)

Raise /proc/sys/vm/page-cluster to 12 or so

also if you're not using much other filesystems, set /proc/sys/vm/vfs_cache_pressure to 100000 (range 1000 to 100000 or more)

Take a look at /proc/sys/vm/min_free_kbytes and /proc/sys/vm/mmap_min_addr what they can achieve and improve in terms of stability, performance and operation of the system

tsch157 commented 7 years ago

Try lowering ARC size to ~ 40-45% of RAM size

The system has 192GB of RAM, so the ARC is already much below the RAM size.

Do you observe only high i/o or also high cpu load and latency ?

The CPU load of the system is at about 70%, that load is caused by the VM´s and their processes. I can not see any issues here, it´s legit. The load of the system can sometimes raise close up to 50.00, although the CPU is not full utilized and I can not see any processes in htop causing extraordinary high I/O rates. Just some VM`s consume some higher I/O values once in a while.

top: %Cpu(s): 39.0 us, 30.8 sy, 0.0 ni, 28.8 id, 1.4 wa, 0.0 hi, 0.1 si, 0.0 s

Is your swap really full ? You need a bigger one ;)

The swap for the ZFS is simply a SSD, it´s not full.

Raise /proc/sys/vm/page-cluster to 12 or so

also if you're not using much other filesystems, set /proc/sys/vm/vfs_cache_pressure to 100000 (range 1000 to 100000 or more)

Take a look at /proc/sys/vm/min_free_kbytes and /proc/sys/vm/mmap_min_addr what they can achieve and improve in terms of stability, performance and operation of the system

Could you explain me what these commands do exactly. I can´t say im deep into ZFS and it´s configuration. I just think that the hostsystem is pretty likely not configured well enough for it´s use. It´s a node for VMs running with a RAID 10 192GB of RAM and should use the SSD for ZFS Caching. The node is already pretty full, but though the I/O problems are pretty massive related to the amount of VM´s running on the machine. We have a node with a simple Raid 1 without ZFS and the double amount of VM´s (althogh these VM´s are much smaller) but on these system there are absolutely no problems with I/O´s.

Greetings!

kernelOfTruth commented 7 years ago

These are general tweaks to reduce memory pressure, prevent (or cause) OOM, allow faster write-out and reading back of swap data, etc.

that's what I meant with

looks more like a mailing list or IRC question ...

folks over there might be able to help you more specifically

shr3k commented 7 years ago

Have similar problem, after 24 hours, txg_sync starts to slowly occupy iotop on almost unloaded node (96G RAM, max ARC 16 GB). Always on 1st mirror of RAID10. L2ARC/ZIL on NVMe. Kernel 4.4.35-2-pve, ZFS 0.6.5.8.

zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
local-zfs  1.81T  94.3G  1.72T         -     2%     5%  1.00x  ONLINE  -
  mirror   928G  47.1G   881G         -     2%     5%
    wwn-0x50014ee65c62b227      -      -      -         -      -      -
    wwn-0x50014ee6070d613b      -      -      -         -      -      -
  mirror   928G  47.2G   881G         -     2%     5%
    wwn-0x50014ee65c62b89e      -      -      -         -      -      -
    wwn-0x50014ee6b1b7ffb0      -      -      -         -      -      -
log      -      -      -         -      -      -
  nvme0n1p1  7.94G  3.50M  7.93G         -    56%     0%
cache      -      -      -         -      -      -
  nvme0n1p2   256G  4.16G   252G         -     0%     1%
DDP74 commented 6 years ago

I have the same problem on pve 5.0-32 with zfs 0.6.5.11-pve18 with L2ARC/ZIL on NVMe. Max ARC 32 GB L2ARC 120GB ZIL 20GB. Some workaround ?

shr3k commented 6 years ago

Some load could be from scrub running, but ocassionally I'm dealing with: https://forum.proxmox.com/threads/frequent-cpu-stalls-in-kvm-guests-during-high-io-on-host.30702/ https://forum.proxmox.com/threads/3000-msec-ping-and-packet-drops-with-virtio-under-load.36687/ https://forum.proxmox.com/threads/kvm-guests-freeze-hung-tasks-during-backup-restore-migrate.34362/