Closed cvoltz closed 3 years ago
Enable dedup
Would be my guess to be the cause of your problem. The moment the DDT dosn't fit into ARC (which it clearly dosn't in your case) performance will completely tank as the system will be busy reading DDT entries from disk (eating their random IOPS, of which you have maybe 400~600 in total in the pool as of the raidz vdev random read characteristics) while at the same time evicting (dbuf_evict) other DDT entries (that likely get read back soon thereafter) to make room in ARC for DDT reads. It can only go downhill from there.
Dedup is a nice idea and works reasonably well when you have more then enough memory to fully cache the DDT, on pure SSD pools or (hopefully, when metadata allocation classes get available) on pools that have some dedicated medatada vdevs backed by fast devices - but on a pure spinning rust pool (especially one made out of raidz vdevs) it's the most effective recipe for bringing ZFS to a crawl.
Dosn't explain though why zfs is that busy in arc_buf_destroy_impl.
Some questions:
# cat /etc/modprobe.d/zfs.conf
options zfs metaslab_debug_unload=1
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_arc_max=101239182336
options zfs zfs_dirty_data_max=13498557645
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
options zfs zfs_vdev_async_write_min_active=16
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_sync_write_min_active=32
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_max_active=2000
The machine has 128 GiB of memory zfs_arc_max
= 101,239,182,336 (about 3/4 of the system memory). zfs_arc_meta_limit_percent
is 75 so the DDT can use 75% of the ARC memory, which is ~71 GiB. With 6.9 TB allocated, assuming an average block size of around 64 kB, that is 107,812,500 blocks and with 320 bytes per block, the DDT should have been around ~32 GiB so I expected it would all be in memory (at least for now).
All screenshots are from about the same time. Yes, rsync
was also running in the background when the rollover was occurring. That would have been generating additional reads as the system it was syncing had already been synced.
Sorry, I misparsed the 13182185+
in the top output as ~12GB of memory for the machine (was counting the places but overlooked the '+') - you're totally right, with an order of magnitude more RAM the DDT issue should be less worse.
Nevertheless, it might be that the metadata reads for rsync/find/... displace DDT from ARC.
Could you please post the output of arcstat.py -f time,read,miss,miss%,dread,dmis,dm%,pread,pmis,pm%,mread,mmis,mm%,arcsz,c,mfu,mru,mfug,mrug,eskip,mtxmis 1 30
while the system is in such a state? Might be helpful...
Looks like having rsync
running and writing new files is relevant to reproducing the problem. When only the cp
is running, dbuf_evict
isn't consuming a bunch of CPU.
# arcstat.py -f time,read,miss,miss%,dread,dmis,dm%,pread,pmis,pm%,mread,mmis,mm%,arcsz,c,mfu,mru,mfug,mrug,eskip,mtxmis 1 30
time read miss miss% dread dmis dm% pread pmis pm% mread mmis mm% arcsz c mfu mru mfug mrug eskip mtxmis
13:28:52 1.2K 181 14 1.2K 181 14 0 0 0 1.2K 181 15 54G 54G 1.0K 0 0 0 0 0
13:28:53 141K 17K 12 139K 17K 12 2.2K 388 17 138K 17K 12 54G 54G 123K 102 395 0 0 0
13:28:54 88K 11K 13 86K 11K 13 1.5K 236 16 86K 11K 13 54G 54G 75K 968 249 304 0 0
13:28:55 74K 10K 13 74K 9.9K 13 452 183 40 72K 10K 13 54G 54G 64K 319 72 121 0 0
13:28:56 143K 19K 13 141K 18K 13 2.2K 368 16 139K 19K 13 54G 54G 124K 339 376 107 0 0
13:28:57 19K 2.6K 13 19K 2.6K 13 0 0 0 18K 2.6K 13 54G 54G 15K 1.5K 2 198 0 0
13:28:58 131K 16K 12 129K 16K 12 2.2K 375 16 128K 16K 13 54G 54G 114K 141 384 4 0 0
13:28:59 86K 11K 13 85K 11K 13 974 229 23 84K 11K 14 54G 54G 71K 2.4K 136 415 0 0
13:29:00 90K 11K 12 89K 11K 12 799 138 17 88K 11K 12 54G 54G 79K 136 134 7 0 0
13:29:01 143K 18K 13 141K 18K 12 1.3K 364 27 139K 18K 13 54G 54G 124K 491 282 260 0 0
13:29:02 32K 4.3K 13 31K 4.2K 13 767 168 21 31K 4.3K 13 54G 54G 26K 1.6K 131 159 0 0
13:29:03 143K 18K 12 142K 17K 12 1.4K 254 18 140K 18K 12 53G 53G 125K 112 260 0 0 4
13:29:04 65K 8.6K 13 64K 8.4K 13 1.5K 228 15 63K 8.6K 13 53G 53G 55K 878 234 311 0 0
13:29:05 30K 3.7K 12 30K 3.7K 12 0 0 0 30K 3.7K 12 53G 53G 26K 103 0 0 0 0
13:29:06 138K 18K 13 136K 17K 13 2.2K 381 17 135K 18K 13 53G 53G 120K 422 389 158 0 0
13:29:07 113K 14K 12 112K 14K 12 1.2K 270 22 111K 14K 13 53G 53G 95K 3.7K 181 266 0 0
13:29:08 164K 20K 12 161K 20K 12 2.2K 370 16 160K 20K 12 53G 53G 143K 74 378 0 0 0
13:29:09 70K 9.8K 13 69K 9.7K 13 723 137 18 68K 9.8K 14 53G 54G 59K 1.6K 146 287 0 0
13:29:10 153K 19K 12 151K 19K 12 2.0K 361 17 149K 19K 12 53G 54G 134K 91 336 30 0 0
13:29:11 76K 10K 13 75K 10K 13 1.3K 287 21 74K 10K 13 54G 54G 65K 959 211 399 0 0
13:29:12 31K 4.4K 13 31K 4.4K 13 26 26 100 30K 4.4K 14 53G 54G 26K 150 28 0 0 0
13:29:13 156K 19K 12 154K 19K 12 1.9K 426 22 152K 19K 13 54G 54G 136K 79 432 0 0 0
13:29:14 39K 5.2K 13 38K 5.1K 13 690 6 0 38K 5.2K 13 54G 54G 33K 1.4K 14 309 0 0
13:29:15 122K 16K 13 120K 15K 13 1.7K 423 25 119K 16K 13 53G 54G 105K 141 431 0 0 0
13:29:16 94K 12K 13 92K 12K 13 1.8K 630 34 91K 12K 13 54G 54G 78K 2.4K 643 301 0 0
13:29:17 113K 17K 15 111K 17K 15 1.5K 138 9 109K 17K 15 53G 54G 95K 193 146 0 0 0
13:29:18 123K 20K 16 121K 19K 16 2.2K 610 28 119K 20K 16 54G 54G 102K 463 534 253 0 0
13:29:19 32K 3.9K 11 31K 3.8K 11 737 118 16 31K 3.9K 12 54G 54G 27K 1.3K 122 117 0 0
13:29:20 134K 17K 12 132K 16K 12 2.6K 588 22 131K 17K 12 54G 54G 117K 414 592 0 0 0
13:29:21 70K 11K 15 69K 11K 15 763 1 0 68K 11K 16 54G 54G 57K 2.0K 12 262 0 0
According to https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_max arcstats.py 'c' column should reflect zfs_arc_max - but why isn't it constant (showing both 54GB and 53GB over the run)?
So either the zfs_arc_max=101239182336 parameter wasn't set for some reason, it was changed (initial and twice mid-run) or something really strange is happening here... could you please check with cat /sys/module/zfs/parameters/zfs_arc_max
what value is actually configured for that parameter?
# cat /sys/module/zfs/parameters/zfs_arc_max
101239182336
It was set by the module load parameters at boot. I certainly didn't change it while the arcstat.py
was running and I don't know of any other process that would have.
@cvoltz Wanted to make sure that the intended value is actually set (I have been bitten by an initramfs getting in the way of this a while ago), didn't ment to imply that you changed zfs_arc_max but was just trying to enumerate how arc_c could be lower than zfs_arc_max when massive amounts of free memory are available (I took it something similar to the 70GB+ from the top screenhot were available while the arcstat.py output was taken, thus no memory pressure should exist that would keep arc_c from growing).
It far as I understand the code I would expect [dbuf_evict] to spin often under a metadata heavy load like rsync - though I can't yet explain why it's consuming that much CPU.
Stumbled over another issue where dbuf_evict took 100% CPU and locked the machine https://github.com/zfsonlinux/zfs/issues/7781, no idea why this was closed (unless enabling dedup these days is a configuration issue, which I think wold be a strange stance).
Regarding the arc_c being way below zfs_arc_max when free memory is available I have a theory:
As far as I can see arc_c can only increase
arc_no_grow
being unset, else the https://github.com/zfsonlinux/zfs/blob/0405eeea6a8d142171ce352d3dac5f08cd6f70f0/module/zfs/arc.c#L5367-L5368 some lines above will prevent arc_adapt() from reaching the section that can do the increase.And arc_no_grow
is being kept set by arc_reclaim until it completed an idle loop and arc_grow_retry
has elapsed, so under high load (keeping it busy, as it must remove entries from ARC as that currently can't grow) it could well be [arc_reclaim] that is keeping the ARC under imagined memory pressure while load lasts, even with plenty of free memory already being available again.
This could be a reason why a busy system is cutting its ARC needlessly short, in your case likely made worse by DDT getting displaced from a shrunk ARC with the metadata load created by rsync/find.
In case you're as curious as me as if that's a correct assumption: I have added grow, free, need columns to arcstat.py, it could be interesting to start that (just add ,grow,free,need
to the -f parameter list) after a reboot (before a pool is initially imported) and then (in another shell) importing the pool followed by the stresstest you did before.
There is also somthing in arc_shrink that looks strange to me, was introduced 5 May 2017 by https://github.com/zfsonlinux/zfs/commit/37fb3e431845b934df9771d7bcca5fbef79f4c1e
https://github.com/zfsonlinux/zfs/blob/0405eeea6a8d142171ce352d3dac5f08cd6f70f0/module/zfs/arc.c#L4759-L4766
not only shrinks the arc_c by to_free but cuts it down to current arc_size (no idea if arc_shrink() could be called outside real external memory pressure though). I just would expect the clamp to be prior the arc_c = c - to_free;
two lines up.
@ahrens or @behlendorf: could you please take a look if that clamping in arc_shrink (last two lines of the quote above) makes sense in that location?
Well, things seem to have from bad to worse here. I renamed the deduped dataset to backup/dedup-snapshots
and created another dataset (without dedup) called backup/snapshots
, and ran rsync
to copy the files from backup/dedup-snapshots
to backup/snapshots
with the intention of eventually removing backup/dedup-snapshots
and using the system without dedup so I wouldn't have to worry about this problem. After the rsync
had copied over a few TB worth of data, since the rsync
was taking so long, I decided I could just sync a subset instead so I stopped it and started rsync
again specifying only the subset. While that was running, I ran ls -d /backup/snapshots/hourly.* | xargs --max-procs=0 --max-args=1 --replace=@ find @ -delete
to remove the files that had already been synced which I decided I didn't need. After a while, systemd-logger
was eating a ton of CPU and the system eventually became completely unresponsive. After the system was rebooted, I was able to find this repeated many times in the system log:
Sep 19 15:30:43 buri kernel: WARNING: CPU: 20 PID: 2092 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
Sep 19 15:30:43 buri kernel: list_del corruption. next->prev should be ffffa07f685637f0, but was dead000000000200
Sep 19 15:30:43 buri kernel: Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass zfs(POE) crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ablk_helper cryptd zunicode(POE) zavl(POE) icp(POE) pcspkr zcommon(POE) znvpair(POE) spl(OE) lpc_ich hpilo hpwdt joydev ipmi_ssif ipmi_si sg ipmi_devintf
Sep 19 15:30:43 buri kernel: ipmi_msghandler acpi_power_meter wmi ioatdma shpchp pcc_cpufreq ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif crct10dif_generic uas usb_storage mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe crct10dif_pclmul crct10dif_common tg3 crc32c_intel serio_raw hpsa mdio i2c_core dca ptp scsi_transport_sas pps_core dm_mirror dm_region_hash dm_log dm_mod
Sep 19 15:30:43 buri kernel: CPU: 20 PID: 2092 Comm: dp_sync_taskq Kdump: loaded Tainted: P OE ------------ 3.10.0-862.11.6.el7.x86_64 #1
Sep 19 15:30:43 buri kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 01/22/2018
Sep 19 15:30:43 buri kernel: Call Trace:
Sep 19 15:30:43 buri kernel: [<ffffffffaa5135d4>] dump_stack+0x19/0x1b
Sep 19 15:30:43 buri kernel: [<ffffffffa9e945d8>] __warn+0xd8/0x100
Sep 19 15:30:43 buri kernel: [<ffffffffa9e9465f>] warn_slowpath_fmt+0x5f/0x80
Sep 19 15:30:43 buri kernel: [<ffffffffaa16cf22>] __list_del_entry+0x82/0xd0
Sep 19 15:30:43 buri kernel: [<ffffffffaa16cf7d>] list_del+0xd/0x30
Sep 19 15:30:43 buri kernel: [<ffffffffc09ca3d5>] multilist_sublist_remove+0x15/0x20 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffc0993094>] userquota_updates_task+0x1d4/0x580 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffc09902f0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffc09902f0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffaa16cf7d>] ? list_del+0xd/0x30
Sep 19 15:30:43 buri kernel: [<ffffffffc04ecd2c>] taskq_thread+0x2ac/0x4f0 [spl]
Sep 19 15:30:43 buri kernel: [<ffffffffa9ed1fe0>] ? wake_up_state+0x20/0x20
Sep 19 15:30:43 buri kernel: [<ffffffffc04eca80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Sep 19 15:30:43 buri kernel: [<ffffffffa9ebdf21>] kthread+0xd1/0xe0
Sep 19 15:30:43 buri kernel: [<ffffffffa9ebde50>] ? insert_kthread_work+0x40/0x40
Sep 19 15:30:43 buri kernel: [<ffffffffaa5255f7>] ret_from_fork_nospec_begin+0x21/0x21
Sep 19 15:30:43 buri kernel: [<ffffffffa9ebde50>] ? insert_kthread_work+0x40/0x40
Sep 19 15:30:43 buri kernel: ---[ end trace fb2fe7f782509b51 ]---
Apparently, this got generated so many times that systemd-logger
started missing kernel messages and, presumably, that was why it was taking so much CPU.
The failure happened several times so I started looking into reproducing it but I can't anymore since the files were finally deleted. I did keep the deduped dataset (for now) so I could probably reproduce the problem if needed but the goal was to setup the system as a backup server not to do testing.
I think the take away here is that there is still some work to do with regards to handling of small files with and without dedup under heavy load (many relatively small files being created, deleted, or having their attributes read).
@cvoltz Can you file a separate issue for the list_del corruption
?
@ahrens https://github.com/zfsonlinux/zfs/issues/7933 seems to be the same list_del
corruption bug.
@ahrens Yes, I can but since @loli10K found a similar existing issue (granted, without much information), do you want me to create a separate issue or append the information above into #7933?
Let's append to the existing bug, thanks for locating that @loli10K
@GregorKopka I definitely see the ARC target size changing on the system when running just rsync
on a the dataset without deduplication, despite memory being available:
# free -g
total used free shared buff/cache available
Mem: 125 67 24 0 33 24
Swap: 7 0 7
# arcstat.py 1
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
11:30:51 35K 3.5K 9 3.3K 9 249 32 3.0K 8 1.1G 94G
...
13:28:09 1.2K 18 1 18 1 0 0 18 1 93G 93G
...
13:28:15 1.3K 54 4 54 4 0 0 54 4 90G 90G
...
13:28:29 1.8K 97 5 97 5 0 0 97 5 91G 91G
...
13:28:40 1.4K 87 6 87 6 0 0 87 6 92G 92G
...
13:29:00 2.0K 66 3 66 3 0 0 66 3 90G 90G
13:29:01 973 48 4 48 4 0 0 48 4 90G 90G
13:29:02 2.0K 51 2 51 2 0 0 51 2 89G 89G
...
13:29:06 2.0K 90 4 90 4 0 0 90 4 89G 90G
...
13:29:18 1.9K 0 0 0 0 0 0 0 0 91G 91G
...
13:29:38 2.0K 25 1 25 1 0 0 25 1 88G 88G
...
13:51:59 39K 6.6K 16 6.6K 16 0 0 6.6K 16 84G 84G
I tried running your modified arcstat.py
as you requested but using just the load from rsync
. You can see the ARC target size changing but the new parameters don't (and it looks like you need to fix a bug in your alignment):
# ./arcstat.py -f time,read,miss,miss%,dread,dmis,dm%,pread,pmis,pm%,mread,mmis,mm%,arcsz,c,mfu,mru,mfug,mrug,eskip,mtxmis,grow,free,need 1 30
time read miss miss% dread dmis dm% pread pmis pm% mread mmis mm% arcsz c mfu mru mfug mrug eskip mtxmis grow free need
14:04:41 24 0 0 24 0 0 0 0 0 24 0 0 81G 82G 24 0 0 0 0 0 0 2.0G 0
14:04:42 1.7K 46 2 1.7K 46 2 0 0 0 1.7K 46 2 82G 82G 1.2K 454 0 0 0 0 0 2.0G 0
14:04:43 1.9K 22 1 1.9K 22 1 0 0 0 1.9K 22 1 82G 82G 1.7K 201 0 0 0 0 0 2.0G 0
14:04:44 1.3K 14 1 1.3K 14 1 0 0 0 1.3K 14 1 82G 82G 916 406 0 0 0 0 0 2.0G 0
14:04:45 2.0K 26 1 2.0K 26 1 0 0 0 2.0K 26 1 82G 82G 1.2K 763 0 0 0 0 0 2.0G 0
14:04:46 1.7K 0 0 1.7K 0 0 0 0 0 1.7K 0 0 82G 82G 1.1K 606 0 0 0 0 0 2.0G 0
14:04:47 1.7K 22 1 1.7K 22 1 0 0 0 1.7K 22 1 82G 82G 1.4K 292 0 0 0 0 0 2.0G 0
14:04:48 1.8K 0 0 1.8K 0 0 0 0 0 1.8K 0 0 82G 82G 1.1K 664 0 0 0 0 0 2.0G 0
14:04:49 2.0K 22 1 2.0K 22 1 0 0 0 2.0K 22 1 82G 82G 1.4K 608 0 0 0 0 0 2.0G 0
14:04:50 1.6K 0 0 1.6K 0 0 0 0 0 1.6K 0 0 82G 83G 1.2K 392 0 0 0 0 0 2.0G 0
14:04:51 1.9K 22 1 1.9K 22 1 0 0 0 1.9K 22 1 83G 83G 1.1K 736 0 0 0 0 0 2.0G 0
14:04:52 1.5K 0 0 1.5K 0 0 0 0 0 1.5K 0 0 83G 83G 1.2K 298 0 0 0 0 0 2.0G 0
14:04:53 1.7K 0 0 1.7K 0 0 0 0 0 1.7K 0 0 83G 83G 1.1K 612 0 0 0 0 0 2.0G 0
14:04:54 1.9K 26 1 1.9K 26 1 0 0 0 1.9K 26 1 83G 83G 1.4K 400 0 0 0 0 0 2.0G 0
14:04:55 1.5K 1 0 1.5K 1 0 0 0 0 1.5K 1 0 83G 83G 1.2K 289 0 0 0 0 0 2.0G 0
14:04:56 2.0K 22 1 2.0K 22 1 0 0 0 2.0K 22 1 83G 83G 1.5K 486 0 0 0 0 0 2.0G 0
14:04:57 1.4K 0 0 1.4K 0 0 0 0 0 1.4K 0 0 83G 83G 831 618 0 0 0 0 0 2.0G 0
14:04:58 1.8K 22 1 1.8K 22 1 0 0 0 1.8K 22 1 83G 83G 1.5K 324 0 0 0 0 0 2.0G 0
14:04:59 1.8K 0 0 1.8K 0 0 0 0 0 1.8K 0 0 83G 83G 1.4K 484 0 0 0 0 0 2.0G 0
14:05:00 2.3K 22 0 2.3K 22 0 0 0 0 2.3K 22 0 84G 84G 1.9K 400 0 0 0 0 0 2.0G 0
14:05:01 1.8K 15 0 1.8K 15 0 0 0 0 1.8K 15 0 84G 84G 1.6K 242 0 0 0 0 0 2.0G 0
14:05:02 1.7K 22 1 1.7K 22 1 0 0 0 1.7K 22 1 84G 84G 1.3K 447 0 0 0 0 0 2.0G 0
14:05:03 2.0K 41 2 2.0K 41 2 0 0 0 2.0K 41 2 84G 84G 1.6K 348 0 0 0 0 0 2.0G 0
14:05:04 2.1K 28 1 2.1K 28 1 0 0 0 2.1K 28 1 84G 84G 1.7K 406 0 0 0 0 0 2.0G 0
14:05:05 1.9K 1 0 1.9K 1 0 0 0 0 1.9K 1 0 84G 84G 1.5K 400 1 0 0 0 0 2.0G 0
14:05:06 1.8K 23 1 1.8K 23 1 0 0 0 1.8K 23 1 84G 84G 1.3K 462 1 0 0 0 0 2.0G 0
14:05:07 1.8K 0 0 1.8K 0 0 0 0 0 1.8K 0 0 84G 84G 1.6K 201 0 0 0 1 0 2.0G 0
14:05:08 2.1K 22 1 2.1K 22 1 0 0 0 2.1K 22 1 84G 84G 1.7K 403 0 0 0 0 0 2.0G 0
14:05:09 2.1K 0 0 2.1K 0 0 0 0 0 2.1K 0 0 84G 84G 1.3K 835 0 0 0 0 0 2.0G 0
14:05:10 1.9K 22 1 1.9K 22 1 0 0 0 1.9K 22 1 84G 84G 1.4K 484 0 0 0 0 0 2.0G 0
@cvoltz Fixed the alignment issues (both the new one on grow
and the existin one on mread
, here is the updated version. Also inverted meaning of the grow
column (it's reporting !arc_no_grow
now to make the column name make sense).
In the second listing ARC is actually growing, so grow=0 (will report as 1 in the new version while arc growth isn't disabled), free=2G (could be clamped to this) and need=0 (no memory pressure) seems to make sense. Would have been interesting to see the new columns for the first run as the ARC was shrinking there, should you do another test in the original configuration (with the deduped dataset) I would be interested at looking at the output over a longer time (and if possible a list with the timestamps when you started/stopped the test, when you noticed the system going sluggish, when [dbuf_free] starting to spin with high CPU, ...).
Will try to dig deeper into this over the weekend.
@cvoltz the strange ARC behavior may be fixed by 5e8ff25644dfed600b4bdfb201ea6db003b4ad2c and the list corruption is address by #8005. Both of these will be on 0.7.12, and are easy to cherry pick if you you need them now.
My machine was updated to 0.8.2-1. I went ahead and tested with the old dedup dataset and captured the updated arcstat.py
output. I have attached the arcstat.log. Here are the timestamps of relevant events:
08:25:30 started cp -al /backup/dedup-snapshots/.sync /backup/dedup-snapshots/hourly.tmp
08:26:00 started rsync -avP /backup/snapshots/.sync/ /backup/dedup-snapshots/.sync/
08:28:06 noticed high CPU for arc_prune occasionally
08:29:27 dbuf_evict starts shows up as a noticeable amount of CPU (6-11%)
08:31:19 dbuf_evict goes to 99% CPU for several seconds
08:33:40 dbuf_evict goes to 24% CPU for several seconds
08:35:30 rsync appears to hang with process in D state
08:37:40 killed rsync
08:38:00 restarted rsync
08:38:50 dbuf_evict high CPU for a couple of seconds
08:40:45 started du -shc /backup/dedup-snapshots/hourly.tmp to see how much data had been copied
08:47:30 killed cp
08:49:00 started ls -d /backup/dedup-snapshots/hourly.tmp | xargs --max-procs=0 --max-args=1 --replace=@ find @ -delete
08:50:05 du finished (102 GB reported)
08:54:00 killed rsync
08:54:30 stopped collection
08:54:52 restarted collection since txg_sync was top process (only 1.3-1.7%) but system load kept increasing
08:57:00 system load at 68.84
08:57:25 system load at 76.84
08:58:10 system load > 80
08:59:15 system load down to 28.62
09:01:30 stopped collection
As best as I can tell, both problems have been resolved. I'm not sure what caused the very high system load. I did see a number of kernel hung task messages in /var/log/messages
at 08:53:56
so I have attached a snippet from that as well (messages.log).
@cvoltz thanks for following up on this. Let's give it a bit longer on 0.8.2 to make sure everything is really alright, assuming it is I'd suggest we close out this issue.
Curious what hardware was at play here and what these values were based on?
options zfs zfs_dirty_data_max=13498557645 options zfs zfs_vdev_async_write_active_min_dirty_percent=20
HP DL380p Gen 8 with 128 GB of memory. The settings are based on the recommend settings for our Lustre machines (to get best performance) where zfs zfs_dirty_data_max
is set to 10% of the available memory using this calculation:
free -b | awk '/Mem/ { printf "%.0f", $2 * 0.10 }'
Sorry I meant more specifically the IO and re cpu use you're seeing. Are those all flash or?
It has dual Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz (10 cores/20 threads each) each with 64 GB of memory. It has one pool with four RAIDZ2 VDEVs. The first two VDEVs each have five MB4000FCWDK drives in an internal enclosure connected to an embedded P420i storage controller The controller itself is in RAID mode but these drives are each configured as a single drive RAID0 volume so they act the same as if the controller were in HBA mode. MB4000FCWDK drives are 4 TB, dual-port, 3.5 inch, 6G, 7200 RPM, midline SAS HDDs. The remaining two VDEVs each have twelve MM2000GEFRA drives in an external D3700 enclosure connected to an embedded P431 storage controller (in HBA mode). MM2000GEFRA drives are 2 TB, single-port, 2.5 inch, 6G, 7200 RPM, midline SATA HDDs. For completeness, the DL380p has 12 bays in the front. The first two drive are 2 TB drives in a hardware RAID1 array and the resulting volume is partitioned for the boot and and OS filesystems. The D3700 has 25 bays and the last bay is a spare drive attached to the pool as a hot-spare.
The machine was cobbled together over time from available parts so It isn't an ideal configuration.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
Poor performance with low I/O utilization and high CPU usage. This may be similar to #6952 except the perf results are different and I'm using the latest release, which suggests a new bottleneck.
Describe how to reproduce the problem
backup
with multiple RAIDZ2 VDEVs In my case, I have 4 VDEVs. 2 have 12 disks (each 2 TB) and the other 2 have 5 disks (each 4 TB) so the VDEVs are roughly equal in size.snapshots
in thebackup
pool.sync
directory within the datasetsnapshots
dataset.sync
directory with a bunch of small files in nested directories My dataset has only 6.49 TB of data from the/etc
,/home
,/opt
,/usr/local
,/var/lib
,/var/log
,/var/named
, and/var/www
directories of several machines in 7,284,710 files in directories as deep as 35 levels.cp -al /backup/snapshots/.sync /backup/snapshots/hourly.0
This is whatrsnapshot
does when rolling over an interval (an hour in this case).Similar poor performance can be seen by running:
find /backup/snapshots/.sync -type f -print | wc -l
orfind /backup/snapshots -type d | sed 's|[^/]||g' | sort | tail -n1
ordu -shc /backup/snapshots
All take an excessively long time (more then an hour) to run.Include any warning/errors/backtraces from the system logs
top
perf top -p 1498
: