openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.57k stars 1.74k forks source link

High CPU usage of dbuf_evict for small random I/O #7922

Closed cvoltz closed 3 years ago

cvoltz commented 6 years ago

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.5
Linux Kernel 3.10.0-862.11.6.el7.x86_64
Architecture x86_64
ZFS Version 0.7.11-1
SPL Version 0.7.11-1

Describe the problem you're observing

Poor performance with low I/O utilization and high CPU usage. This may be similar to #6952 except the perf results are different and I'm using the latest release, which suggests a new bottleneck.

Describe how to reproduce the problem

  1. Create a pool named backup with multiple RAIDZ2 VDEVs In my case, I have 4 VDEVs. 2 have 12 disks (each 2 TB) and the other 2 have 5 disks (each 4 TB) so the VDEVs are roughly equal in size.
  2. Create a dataset named snapshots in the backup pool
  3. Create a .sync directory within the dataset
  4. Enable dedup and compression on the snapshots dataset
  5. Fill the .sync directory with a bunch of small files in nested directories My dataset has only 6.49 TB of data from the /etc, /home, /opt, /usr/local, /var/lib, /var/log, /var/named, and /var/www directories of several machines in 7,284,710 files in directories as deep as 35 levels.
  6. Run cp -al /backup/snapshots/.sync /backup/snapshots/hourly.0 This is what rsnapshot does when rolling over an interval (an hour in this case).

Similar poor performance can be seen by running: find /backup/snapshots/.sync -type f -print | wc -l or find /backup/snapshots -type d | sed 's|[^/]||g' | sort | tail -n1 or du -shc /backup/snapshots All take an excessively long time (more then an hour) to run.

Include any warning/errors/backtraces from the system logs

top screenshot from 2018-09-17 14-13-12

perf top -p 1498: screenshot from 2018-09-17 14-17-22

# zpool iostat 1 30
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
backup      6.49T  73.3T    237    708  1.36M  4.35M
backup      6.49T  73.3T    420      0  1.64M      0
backup      6.49T  73.3T    268      0  1.05M      0
backup      6.49T  73.3T    419      0  1.65M      0
backup      6.49T  73.3T    584      0  2.28M      0
backup      6.49T  73.3T    513  3.04K  2.01M  18.4M
backup      6.49T  73.3T    228      0   915K      0
backup      6.49T  73.3T    102      0   410K      0
backup      6.49T  73.3T    265      0  1.04M      0
backup      6.49T  73.3T    229      0   919K      0
backup      6.49T  73.3T    261  2.72K  1.02M  16.9M
backup      6.49T  73.3T    219      0   879K      0
backup      6.49T  73.3T    284      0  1.11M      0
backup      6.49T  73.3T    192      0   772K      0
backup      6.49T  73.3T    263  1.01K  1.04M  4.82M
backup      6.49T  73.3T     85  2.45K   342K  14.7M
backup      6.49T  73.3T    132      0   529K      0
backup      6.49T  73.3T    157      0   631K      0
backup      6.49T  73.3T    126      0   504K      0
backup      6.49T  73.3T     47    275   191K  1.08M
backup      6.49T  73.3T     59  1.95K   254K  11.3M
backup      6.49T  73.3T     52      0   211K      0
backup      6.49T  73.3T     40      0   163K      0
backup      6.49T  73.3T     45      0   183K      0
backup      6.49T  73.3T     38    561   155K  2.25M
backup      6.49T  73.3T     23    966  95.3K  3.96M
backup      6.49T  73.3T     40      0   163K      0
backup      6.49T  73.3T     53      0   215K      0
backup      6.49T  73.3T     37      0   151K      0
backup      6.49T  73.3T     31    136   127K   547K
# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
backup            6.54T  48.3T   230K  /backup
backup/log        63.4M  48.3T  63.4M  /backup/log
backup/snapshots  6.49T  48.3T  6.49T  /backup/snapshots
# zpool get all backup
NAME    PROPERTY                       VALUE                          SOURCE
backup  size                           79.8T                          -
backup  capacity                       8%                             -
backup  altroot                        -                              default
backup  health                         ONLINE                         -
backup  guid                           16460844651660370475           -
backup  version                        -                              default
backup  bootfs                         -                              default
backup  delegation                     on                             default
backup  autoreplace                    off                            default
backup  cachefile                      none                           local
backup  failmode                       wait                           default
backup  listsnapshots                  off                            default
backup  autoexpand                     off                            default
backup  dedupditto                     0                              default
backup  dedupratio                     1.56x                          -
backup  free                           73.3T                          -
backup  allocated                      6.49T                          -
backup  readonly                       off                            -
backup  ashift                         12                             local
backup  comment                        -                              default
backup  expandsize                     -                              -
backup  freeing                        0                              -
backup  fragmentation                  4%                             -
backup  leaked                         0                              -
backup  multihost                      off                            default
backup  feature@async_destroy          enabled                        local
backup  feature@empty_bpobj            active                         local
backup  feature@lz4_compress           active                         local
backup  feature@multi_vdev_crash_dump  enabled                        local
backup  feature@spacemap_histogram     active                         local
backup  feature@enabled_txg            active                         local
backup  feature@hole_birth             active                         local
backup  feature@extensible_dataset     active                         local
backup  feature@embedded_data          active                         local
backup  feature@bookmarks              enabled                        local
backup  feature@filesystem_limits      enabled                        local
backup  feature@large_blocks           enabled                        local
backup  feature@large_dnode            active                         local
backup  feature@sha512                 enabled                        local
backup  feature@skein                  enabled                        local
backup  feature@edonr                  enabled                        local
backup  feature@userobj_accounting     active                         local
# zfs get all backup/snapshots
NAME              PROPERTY              VALUE                  SOURCE
backup/snapshots  type                  filesystem             -
backup/snapshots  creation              Sat Sep 15 15:35 2018  -
backup/snapshots  used                  6.49T                  -
backup/snapshots  available             48.3T                  -
backup/snapshots  referenced            6.49T                  -
backup/snapshots  compressratio         1.06x                  -
backup/snapshots  mounted               yes                    -
backup/snapshots  quota                 none                   default
backup/snapshots  reservation           none                   default
backup/snapshots  recordsize            128K                   default
backup/snapshots  mountpoint            /backup/snapshots      default
backup/snapshots  sharenfs              off                    default
backup/snapshots  checksum              on                     default
backup/snapshots  compression           on                     local
backup/snapshots  atime                 on                     default
backup/snapshots  devices               on                     default
backup/snapshots  exec                  on                     default
backup/snapshots  setuid                on                     default
backup/snapshots  readonly              off                    default
backup/snapshots  zoned                 off                    default
backup/snapshots  snapdir               hidden                 default
backup/snapshots  aclinherit            restricted             default
backup/snapshots  createtxg             24                     -
backup/snapshots  canmount              on                     default
backup/snapshots  xattr                 on                     default
backup/snapshots  copies                1                      default
backup/snapshots  version               5                      -
backup/snapshots  utf8only              off                    -
backup/snapshots  normalization         none                   -
backup/snapshots  casesensitivity       sensitive              -
backup/snapshots  vscan                 off                    default
backup/snapshots  nbmand                off                    default
backup/snapshots  sharesmb              off                    default
backup/snapshots  refquota              none                   default
backup/snapshots  refreservation        none                   default
backup/snapshots  guid                  7864697502022885929    -
backup/snapshots  primarycache          all                    default
backup/snapshots  secondarycache        all                    default
backup/snapshots  usedbysnapshots       0B                     -
backup/snapshots  usedbydataset         6.49T                  -
backup/snapshots  usedbychildren        0B                     -
backup/snapshots  usedbyrefreservation  0B                     -
backup/snapshots  logbias               latency                default
backup/snapshots  dedup                 on                     local
backup/snapshots  mlslabel              none                   default
backup/snapshots  sync                  standard               default
backup/snapshots  dnodesize             legacy                 default
backup/snapshots  refcompressratio      1.06x                  -
backup/snapshots  written               6.49T                  -
backup/snapshots  logicalused           6.73T                  -
backup/snapshots  logicalreferenced     6.73T                  -
backup/snapshots  volmode               default                default
backup/snapshots  filesystem_limit      none                   default
backup/snapshots  snapshot_limit        none                   default
backup/snapshots  filesystem_count      none                   default
backup/snapshots  snapshot_count        none                   default
backup/snapshots  snapdev               hidden                 default
backup/snapshots  acltype               off                    default
backup/snapshots  context               none                   default
backup/snapshots  fscontext             none                   default
backup/snapshots  defcontext            none                   default
backup/snapshots  rootcontext           none                   default
backup/snapshots  relatime              off                    default
backup/snapshots  redundant_metadata    all                    default
backup/snapshots  overlay               off                    default
GregorKopka commented 6 years ago

Enable dedup

Would be my guess to be the cause of your problem. The moment the DDT dosn't fit into ARC (which it clearly dosn't in your case) performance will completely tank as the system will be busy reading DDT entries from disk (eating their random IOPS, of which you have maybe 400~600 in total in the pool as of the raidz vdev random read characteristics) while at the same time evicting (dbuf_evict) other DDT entries (that likely get read back soon thereafter) to make room in ARC for DDT reads. It can only go downhill from there.

Dedup is a nice idea and works reasonably well when you have more then enough memory to fully cache the DDT, on pure SSD pools or (hopefully, when metadata allocation classes get available) on pools that have some dedicated medatada vdevs backed by fast devices - but on a pure spinning rust pool (especially one made out of raidz vdevs) it's the most effective recipe for bringing ZFS to a crawl.

Dosn't explain though why zfs is that busy in arc_buf_destroy_impl.

Some questions:

cvoltz commented 6 years ago
# cat /etc/modprobe.d/zfs.conf
options zfs metaslab_debug_unload=1
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_arc_max=101239182336
options zfs zfs_dirty_data_max=13498557645
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
options zfs zfs_vdev_async_write_min_active=16
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_sync_write_min_active=32
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_max_active=2000

The machine has 128 GiB of memory zfs_arc_max = 101,239,182,336 (about 3/4 of the system memory). zfs_arc_meta_limit_percent is 75 so the DDT can use 75% of the ARC memory, which is ~71 GiB. With 6.9 TB allocated, assuming an average block size of around 64 kB, that is 107,812,500 blocks and with 320 bytes per block, the DDT should have been around ~32 GiB so I expected it would all be in memory (at least for now).

All screenshots are from about the same time. Yes, rsync was also running in the background when the rollover was occurring. That would have been generating additional reads as the system it was syncing had already been synced.

GregorKopka commented 6 years ago

Sorry, I misparsed the 13182185+ in the top output as ~12GB of memory for the machine (was counting the places but overlooked the '+') - you're totally right, with an order of magnitude more RAM the DDT issue should be less worse.

Nevertheless, it might be that the metadata reads for rsync/find/... displace DDT from ARC.

Could you please post the output of arcstat.py -f time,read,miss,miss%,dread,dmis,dm%,pread,pmis,pm%,mread,mmis,mm%,arcsz,c,mfu,mru,mfug,mrug,eskip,mtxmis 1 30 while the system is in such a state? Might be helpful...

cvoltz commented 6 years ago

Looks like having rsync running and writing new files is relevant to reproducing the problem. When only the cp is running, dbuf_evict isn't consuming a bunch of CPU.

# arcstat.py -f time,read,miss,miss%,dread,dmis,dm%,pread,pmis,pm%,mread,mmis,mm%,arcsz,c,mfu,mru,mfug,mrug,eskip,mtxmis 1 30
    time  read  miss  miss%  dread  dmis  dm%  pread  pmis  pm%  mread  mmis  mm%  arcsz     c   mfu   mru  mfug  mrug  eskip  mtxmis
13:28:52  1.2K   181     14   1.2K   181   14      0     0    0  1.2K   181   15    54G   54G  1.0K     0     0     0      0       0
13:28:53  141K   17K     12   139K   17K   12   2.2K   388   17  138K   17K   12    54G   54G  123K   102   395     0      0       0
13:28:54   88K   11K     13    86K   11K   13   1.5K   236   16   86K   11K   13    54G   54G   75K   968   249   304      0       0
13:28:55   74K   10K     13    74K  9.9K   13    452   183   40   72K   10K   13    54G   54G   64K   319    72   121      0       0
13:28:56  143K   19K     13   141K   18K   13   2.2K   368   16  139K   19K   13    54G   54G  124K   339   376   107      0       0
13:28:57   19K  2.6K     13    19K  2.6K   13      0     0    0   18K  2.6K   13    54G   54G   15K  1.5K     2   198      0       0
13:28:58  131K   16K     12   129K   16K   12   2.2K   375   16  128K   16K   13    54G   54G  114K   141   384     4      0       0
13:28:59   86K   11K     13    85K   11K   13    974   229   23   84K   11K   14    54G   54G   71K  2.4K   136   415      0       0
13:29:00   90K   11K     12    89K   11K   12    799   138   17   88K   11K   12    54G   54G   79K   136   134     7      0       0 
13:29:01  143K   18K     13   141K   18K   12   1.3K   364   27  139K   18K   13    54G   54G  124K   491   282   260      0       0
13:29:02   32K  4.3K     13    31K  4.2K   13    767   168   21   31K  4.3K   13    54G   54G   26K  1.6K   131   159      0       0
13:29:03  143K   18K     12   142K   17K   12   1.4K   254   18  140K   18K   12    53G   53G  125K   112   260     0      0       4
13:29:04   65K  8.6K     13    64K  8.4K   13   1.5K   228   15   63K  8.6K   13    53G   53G   55K   878   234   311      0       0
13:29:05   30K  3.7K     12    30K  3.7K   12      0     0    0   30K  3.7K   12    53G   53G   26K   103     0     0      0       0
13:29:06  138K   18K     13   136K   17K   13   2.2K   381   17  135K   18K   13    53G   53G  120K   422   389   158      0       0
13:29:07  113K   14K     12   112K   14K   12   1.2K   270   22  111K   14K   13    53G   53G   95K  3.7K   181   266      0       0
13:29:08  164K   20K     12   161K   20K   12   2.2K   370   16  160K   20K   12    53G   53G  143K    74   378     0      0       0
13:29:09   70K  9.8K     13    69K  9.7K   13    723   137   18   68K  9.8K   14    53G   54G   59K  1.6K   146   287      0       0
13:29:10  153K   19K     12   151K   19K   12   2.0K   361   17  149K   19K   12    53G   54G  134K    91   336    30      0       0
13:29:11   76K   10K     13    75K   10K   13   1.3K   287   21   74K   10K   13    54G   54G   65K   959   211   399      0       0
13:29:12   31K  4.4K     13    31K  4.4K   13     26    26  100   30K  4.4K   14    53G   54G   26K   150    28     0      0       0
13:29:13  156K   19K     12   154K   19K   12   1.9K   426   22  152K   19K   13    54G   54G  136K    79   432     0      0       0
13:29:14   39K  5.2K     13    38K  5.1K   13    690     6    0   38K  5.2K   13    54G   54G   33K  1.4K    14   309      0       0
13:29:15  122K   16K     13   120K   15K   13   1.7K   423   25  119K   16K   13    53G   54G  105K   141   431     0      0       0
13:29:16   94K   12K     13    92K   12K   13   1.8K   630   34   91K   12K   13    54G   54G   78K  2.4K   643   301      0       0
13:29:17  113K   17K     15   111K   17K   15   1.5K   138    9  109K   17K   15    53G   54G   95K   193   146     0      0       0
13:29:18  123K   20K     16   121K   19K   16   2.2K   610   28  119K   20K   16    54G   54G  102K   463   534   253      0       0
13:29:19   32K  3.9K     11    31K  3.8K   11    737   118   16   31K  3.9K   12    54G   54G   27K  1.3K   122   117      0       0
13:29:20  134K   17K     12   132K   16K   12   2.6K   588   22  131K   17K   12    54G   54G  117K   414   592     0      0       0
13:29:21   70K   11K     15    69K   11K   15    763     1    0   68K   11K   16    54G   54G   57K  2.0K    12   262      0       0
GregorKopka commented 6 years ago

According to https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_arc_max arcstats.py 'c' column should reflect zfs_arc_max - but why isn't it constant (showing both 54GB and 53GB over the run)?

So either the zfs_arc_max=101239182336 parameter wasn't set for some reason, it was changed (initial and twice mid-run) or something really strange is happening here... could you please check with cat /sys/module/zfs/parameters/zfs_arc_max what value is actually configured for that parameter?

cvoltz commented 6 years ago
# cat /sys/module/zfs/parameters/zfs_arc_max
101239182336

It was set by the module load parameters at boot. I certainly didn't change it while the arcstat.py was running and I don't know of any other process that would have.

GregorKopka commented 6 years ago

@cvoltz Wanted to make sure that the intended value is actually set (I have been bitten by an initramfs getting in the way of this a while ago), didn't ment to imply that you changed zfs_arc_max but was just trying to enumerate how arc_c could be lower than zfs_arc_max when massive amounts of free memory are available (I took it something similar to the 70GB+ from the top screenhot were available while the arcstat.py output was taken, thus no memory pressure should exist that would keep arc_c from growing).

It far as I understand the code I would expect [dbuf_evict] to spin often under a metadata heavy load like rsync - though I can't yet explain why it's consuming that much CPU.

Stumbled over another issue where dbuf_evict took 100% CPU and locked the machine https://github.com/zfsonlinux/zfs/issues/7781, no idea why this was closed (unless enabling dedup these days is a configuration issue, which I think wold be a strange stance).

Regarding the arc_c being way below zfs_arc_max when free memory is available I have a theory:

As far as I can see arc_c can only increase

And arc_no_grow is being kept set by arc_reclaim until it completed an idle loop and arc_grow_retry has elapsed, so under high load (keeping it busy, as it must remove entries from ARC as that currently can't grow) it could well be [arc_reclaim] that is keeping the ARC under imagined memory pressure while load lasts, even with plenty of free memory already being available again.

This could be a reason why a busy system is cutting its ARC needlessly short, in your case likely made worse by DDT getting displaced from a shrunk ARC with the metadata load created by rsync/find.

In case you're as curious as me as if that's a correct assumption: I have added grow, free, need columns to arcstat.py, it could be interesting to start that (just add ,grow,free,need to the -f parameter list) after a reboot (before a pool is initially imported) and then (in another shell) importing the pool followed by the stresstest you did before.

There is also somthing in arc_shrink that looks strange to me, was introduced 5 May 2017 by https://github.com/zfsonlinux/zfs/commit/37fb3e431845b934df9771d7bcca5fbef79f4c1e https://github.com/zfsonlinux/zfs/blob/0405eeea6a8d142171ce352d3dac5f08cd6f70f0/module/zfs/arc.c#L4759-L4766 not only shrinks the arc_c by to_free but cuts it down to current arc_size (no idea if arc_shrink() could be called outside real external memory pressure though). I just would expect the clamp to be prior the arc_c = c - to_free; two lines up. @ahrens or @behlendorf: could you please take a look if that clamping in arc_shrink (last two lines of the quote above) makes sense in that location?

cvoltz commented 6 years ago

Well, things seem to have from bad to worse here. I renamed the deduped dataset to backup/dedup-snapshots and created another dataset (without dedup) called backup/snapshots, and ran rsync to copy the files from backup/dedup-snapshots to backup/snapshots with the intention of eventually removing backup/dedup-snapshots and using the system without dedup so I wouldn't have to worry about this problem. After the rsync had copied over a few TB worth of data, since the rsync was taking so long, I decided I could just sync a subset instead so I stopped it and started rsync again specifying only the subset. While that was running, I ran ls -d /backup/snapshots/hourly.* | xargs --max-procs=0 --max-args=1 --replace=@ find @ -delete to remove the files that had already been synced which I decided I didn't need. After a while, systemd-logger was eating a ton of CPU and the system eventually became completely unresponsive. After the system was rebooted, I was able to find this repeated many times in the system log:

Sep 19 15:30:43 buri kernel: WARNING: CPU: 20 PID: 2092 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
Sep 19 15:30:43 buri kernel: list_del corruption. next->prev should be ffffa07f685637f0, but was dead000000000200
Sep 19 15:30:43 buri kernel: Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass zfs(POE) crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel lrw gf128mul glue_helper ablk_helper cryptd zunicode(POE) zavl(POE) icp(POE) pcspkr zcommon(POE) znvpair(POE) spl(OE) lpc_ich hpilo hpwdt joydev ipmi_ssif ipmi_si sg ipmi_devintf
Sep 19 15:30:43 buri kernel: ipmi_msghandler acpi_power_meter wmi ioatdma shpchp pcc_cpufreq ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif crct10dif_generic uas usb_storage mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe crct10dif_pclmul crct10dif_common tg3 crc32c_intel serio_raw hpsa mdio i2c_core dca ptp scsi_transport_sas pps_core dm_mirror dm_region_hash dm_log dm_mod
Sep 19 15:30:43 buri kernel: CPU: 20 PID: 2092 Comm: dp_sync_taskq Kdump: loaded Tainted: P           OE  ------------   3.10.0-862.11.6.el7.x86_64 #1
Sep 19 15:30:43 buri kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 01/22/2018
Sep 19 15:30:43 buri kernel: Call Trace:
Sep 19 15:30:43 buri kernel: [<ffffffffaa5135d4>] dump_stack+0x19/0x1b
Sep 19 15:30:43 buri kernel: [<ffffffffa9e945d8>] __warn+0xd8/0x100
Sep 19 15:30:43 buri kernel: [<ffffffffa9e9465f>] warn_slowpath_fmt+0x5f/0x80
Sep 19 15:30:43 buri kernel: [<ffffffffaa16cf22>] __list_del_entry+0x82/0xd0
Sep 19 15:30:43 buri kernel: [<ffffffffaa16cf7d>] list_del+0xd/0x30
Sep 19 15:30:43 buri kernel: [<ffffffffc09ca3d5>] multilist_sublist_remove+0x15/0x20 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffc0993094>] userquota_updates_task+0x1d4/0x580 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffc09902f0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffc09902f0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
Sep 19 15:30:43 buri kernel: [<ffffffffaa16cf7d>] ? list_del+0xd/0x30
Sep 19 15:30:43 buri kernel: [<ffffffffc04ecd2c>] taskq_thread+0x2ac/0x4f0 [spl]
Sep 19 15:30:43 buri kernel: [<ffffffffa9ed1fe0>] ? wake_up_state+0x20/0x20
Sep 19 15:30:43 buri kernel: [<ffffffffc04eca80>] ? taskq_thread_spawn+0x60/0x60 [spl]
Sep 19 15:30:43 buri kernel: [<ffffffffa9ebdf21>] kthread+0xd1/0xe0
Sep 19 15:30:43 buri kernel: [<ffffffffa9ebde50>] ? insert_kthread_work+0x40/0x40
Sep 19 15:30:43 buri kernel: [<ffffffffaa5255f7>] ret_from_fork_nospec_begin+0x21/0x21
Sep 19 15:30:43 buri kernel: [<ffffffffa9ebde50>] ? insert_kthread_work+0x40/0x40
Sep 19 15:30:43 buri kernel: ---[ end trace fb2fe7f782509b51 ]---

Apparently, this got generated so many times that systemd-logger started missing kernel messages and, presumably, that was why it was taking so much CPU.

The failure happened several times so I started looking into reproducing it but I can't anymore since the files were finally deleted. I did keep the deduped dataset (for now) so I could probably reproduce the problem if needed but the goal was to setup the system as a backup server not to do testing.

I think the take away here is that there is still some work to do with regards to handling of small files with and without dedup under heavy load (many relatively small files being created, deleted, or having their attributes read).

ahrens commented 6 years ago

@cvoltz Can you file a separate issue for the list_del corruption?

loli10K commented 6 years ago

@ahrens https://github.com/zfsonlinux/zfs/issues/7933 seems to be the same list_del corruption bug.

cvoltz commented 6 years ago

@ahrens Yes, I can but since @loli10K found a similar existing issue (granted, without much information), do you want me to create a separate issue or append the information above into #7933?

ahrens commented 6 years ago

Let's append to the existing bug, thanks for locating that @loli10K

cvoltz commented 6 years ago

@GregorKopka I definitely see the ARC target size changing on the system when running just rsync on a the dataset without deduplication, despite memory being available:

# free -g
              total        used        free      shared  buff/cache   available
Mem:            125          67          24           0          33          24
Swap:             7           0           7
# arcstat.py  1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
11:30:51   35K  3.5K      9  3.3K    9   249   32  3.0K    8   1.1G   94G
...
13:28:09  1.2K    18      1    18    1     0    0    18    1    93G   93G
...
13:28:15  1.3K    54      4    54    4     0    0    54    4    90G   90G
...
13:28:29  1.8K    97      5    97    5     0    0    97    5    91G   91G
...
13:28:40  1.4K    87      6    87    6     0    0    87    6    92G   92G
...
13:29:00  2.0K    66      3    66    3     0    0    66    3    90G   90G
13:29:01   973    48      4    48    4     0    0    48    4    90G   90G
13:29:02  2.0K    51      2    51    2     0    0    51    2    89G   89G
...
13:29:06  2.0K    90      4    90    4     0    0    90    4    89G   90G
...
13:29:18  1.9K     0      0     0    0     0    0     0    0    91G   91G
...
13:29:38  2.0K    25      1    25    1     0    0    25    1    88G   88G
...
13:51:59   39K  6.6K     16  6.6K   16     0    0  6.6K   16    84G   84G

I tried running your modified arcstat.py as you requested but using just the load from rsync. You can see the ARC target size changing but the new parameters don't (and it looks like you need to fix a bug in your alignment):

# ./arcstat.py -f time,read,miss,miss%,dread,dmis,dm%,pread,pmis,pm%,mread,mmis,mm%,arcsz,c,mfu,mru,mfug,mrug,eskip,mtxmis,grow,free,need 1 30
    time  read  miss  miss%  dread  dmis  dm%  pread  pmis  pm%  mread  mmis  mm%  arcsz     c   mfu   mru  mfug  mrug  eskip  mtxmis  grow  free  need
14:04:41    24     0      0     24     0    0      0     0    0    24     0    0    81G   82G    24     0     0     0      0       0  0  2.0G     0
14:04:42  1.7K    46      2   1.7K    46    2      0     0    0  1.7K    46    2    82G   82G  1.2K   454     0     0      0       0  0  2.0G     0
14:04:43  1.9K    22      1   1.9K    22    1      0     0    0  1.9K    22    1    82G   82G  1.7K   201     0     0      0       0  0  2.0G     0
14:04:44  1.3K    14      1   1.3K    14    1      0     0    0  1.3K    14    1    82G   82G   916   406     0     0      0       0  0  2.0G     0
14:04:45  2.0K    26      1   2.0K    26    1      0     0    0  2.0K    26    1    82G   82G  1.2K   763     0     0      0       0  0  2.0G     0
14:04:46  1.7K     0      0   1.7K     0    0      0     0    0  1.7K     0    0    82G   82G  1.1K   606     0     0      0       0  0  2.0G     0
14:04:47  1.7K    22      1   1.7K    22    1      0     0    0  1.7K    22    1    82G   82G  1.4K   292     0     0      0       0  0  2.0G     0
14:04:48  1.8K     0      0   1.8K     0    0      0     0    0  1.8K     0    0    82G   82G  1.1K   664     0     0      0       0  0  2.0G     0
14:04:49  2.0K    22      1   2.0K    22    1      0     0    0  2.0K    22    1    82G   82G  1.4K   608     0     0      0       0  0  2.0G     0
14:04:50  1.6K     0      0   1.6K     0    0      0     0    0  1.6K     0    0    82G   83G  1.2K   392     0     0      0       0  0  2.0G     0
14:04:51  1.9K    22      1   1.9K    22    1      0     0    0  1.9K    22    1    83G   83G  1.1K   736     0     0      0       0  0  2.0G     0
14:04:52  1.5K     0      0   1.5K     0    0      0     0    0  1.5K     0    0    83G   83G  1.2K   298     0     0      0       0  0  2.0G     0
14:04:53  1.7K     0      0   1.7K     0    0      0     0    0  1.7K     0    0    83G   83G  1.1K   612     0     0      0       0  0  2.0G     0
14:04:54  1.9K    26      1   1.9K    26    1      0     0    0  1.9K    26    1    83G   83G  1.4K   400     0     0      0       0  0  2.0G     0
14:04:55  1.5K     1      0   1.5K     1    0      0     0    0  1.5K     1    0    83G   83G  1.2K   289     0     0      0       0  0  2.0G     0
14:04:56  2.0K    22      1   2.0K    22    1      0     0    0  2.0K    22    1    83G   83G  1.5K   486     0     0      0       0  0  2.0G     0
14:04:57  1.4K     0      0   1.4K     0    0      0     0    0  1.4K     0    0    83G   83G   831   618     0     0      0       0  0  2.0G     0
14:04:58  1.8K    22      1   1.8K    22    1      0     0    0  1.8K    22    1    83G   83G  1.5K   324     0     0      0       0  0  2.0G     0
14:04:59  1.8K     0      0   1.8K     0    0      0     0    0  1.8K     0    0    83G   83G  1.4K   484     0     0      0       0  0  2.0G     0
14:05:00  2.3K    22      0   2.3K    22    0      0     0    0  2.3K    22    0    84G   84G  1.9K   400     0     0      0       0  0  2.0G     0
14:05:01  1.8K    15      0   1.8K    15    0      0     0    0  1.8K    15    0    84G   84G  1.6K   242     0     0      0       0  0  2.0G     0
14:05:02  1.7K    22      1   1.7K    22    1      0     0    0  1.7K    22    1    84G   84G  1.3K   447     0     0      0       0  0  2.0G     0
14:05:03  2.0K    41      2   2.0K    41    2      0     0    0  2.0K    41    2    84G   84G  1.6K   348     0     0      0       0  0  2.0G     0
14:05:04  2.1K    28      1   2.1K    28    1      0     0    0  2.1K    28    1    84G   84G  1.7K   406     0     0      0       0  0  2.0G     0
14:05:05  1.9K     1      0   1.9K     1    0      0     0    0  1.9K     1    0    84G   84G  1.5K   400     1     0      0       0  0  2.0G     0
14:05:06  1.8K    23      1   1.8K    23    1      0     0    0  1.8K    23    1    84G   84G  1.3K   462     1     0      0       0  0  2.0G     0
14:05:07  1.8K     0      0   1.8K     0    0      0     0    0  1.8K     0    0    84G   84G  1.6K   201     0     0      0       1  0  2.0G     0
14:05:08  2.1K    22      1   2.1K    22    1      0     0    0  2.1K    22    1    84G   84G  1.7K   403     0     0      0       0  0  2.0G     0
14:05:09  2.1K     0      0   2.1K     0    0      0     0    0  2.1K     0    0    84G   84G  1.3K   835     0     0      0       0  0  2.0G     0
14:05:10  1.9K    22      1   1.9K    22    1      0     0    0  1.9K    22    1    84G   84G  1.4K   484     0     0      0       0  0  2.0G     0
GregorKopka commented 6 years ago

@cvoltz Fixed the alignment issues (both the new one on grow and the existin one on mread, here is the updated version. Also inverted meaning of the grow column (it's reporting !arc_no_grow now to make the column name make sense).

In the second listing ARC is actually growing, so grow=0 (will report as 1 in the new version while arc growth isn't disabled), free=2G (could be clamped to this) and need=0 (no memory pressure) seems to make sense. Would have been interesting to see the new columns for the first run as the ARC was shrinking there, should you do another test in the original configuration (with the deduped dataset) I would be interested at looking at the output over a longer time (and if possible a list with the timestamps when you started/stopped the test, when you noticed the system going sluggish, when [dbuf_free] starting to spin with high CPU, ...).

Will try to dig deeper into this over the weekend.

behlendorf commented 6 years ago

@cvoltz the strange ARC behavior may be fixed by 5e8ff25644dfed600b4bdfb201ea6db003b4ad2c and the list corruption is address by #8005. Both of these will be on 0.7.12, and are easy to cherry pick if you you need them now.

cvoltz commented 5 years ago

My machine was updated to 0.8.2-1. I went ahead and tested with the old dedup dataset and captured the updated arcstat.py output. I have attached the arcstat.log. Here are the timestamps of relevant events:

    08:25:30 started cp -al /backup/dedup-snapshots/.sync /backup/dedup-snapshots/hourly.tmp
    08:26:00 started rsync -avP /backup/snapshots/.sync/ /backup/dedup-snapshots/.sync/
    08:28:06 noticed high CPU for arc_prune occasionally
    08:29:27 dbuf_evict starts shows up as a noticeable amount of CPU (6-11%)
    08:31:19 dbuf_evict goes to 99% CPU for several seconds
    08:33:40 dbuf_evict goes to 24% CPU for several seconds
    08:35:30 rsync appears to hang with process in D state
    08:37:40 killed rsync
    08:38:00 restarted rsync
    08:38:50 dbuf_evict high CPU for a couple of seconds
    08:40:45 started du -shc /backup/dedup-snapshots/hourly.tmp to see how much data had been copied
    08:47:30 killed cp
    08:49:00 started ls -d /backup/dedup-snapshots/hourly.tmp | xargs --max-procs=0 --max-args=1 --replace=@ find @ -delete
    08:50:05 du finished (102 GB reported)
    08:54:00 killed rsync
    08:54:30 stopped collection
    08:54:52 restarted collection since txg_sync was top process (only 1.3-1.7%) but system load kept increasing
    08:57:00 system load at 68.84
    08:57:25 system load at 76.84
    08:58:10 system load > 80
    08:59:15 system load down to 28.62
    09:01:30 stopped collection

As best as I can tell, both problems have been resolved. I'm not sure what caused the very high system load. I did see a number of kernel hung task messages in /var/log/messages at 08:53:56 so I have attached a snippet from that as well (messages.log).

behlendorf commented 5 years ago

@cvoltz thanks for following up on this. Let's give it a bit longer on 0.8.2 to make sure everything is really alright, assuming it is I'd suggest we close out this issue.

h1z1 commented 4 years ago

Curious what hardware was at play here and what these values were based on?

options zfs zfs_dirty_data_max=13498557645 options zfs zfs_vdev_async_write_active_min_dirty_percent=20

cvoltz commented 4 years ago

HP DL380p Gen 8 with 128 GB of memory. The settings are based on the recommend settings for our Lustre machines (to get best performance) where zfs zfs_dirty_data_max is set to 10% of the available memory using this calculation:

free -b | awk '/Mem/ { printf "%.0f", $2 * 0.10 }'
h1z1 commented 4 years ago

Sorry I meant more specifically the IO and re cpu use you're seeing. Are those all flash or?

cvoltz commented 4 years ago

It has dual Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz (10 cores/20 threads each) each with 64 GB of memory. It has one pool with four RAIDZ2 VDEVs. The first two VDEVs each have five MB4000FCWDK drives in an internal enclosure connected to an embedded P420i storage controller The controller itself is in RAID mode but these drives are each configured as a single drive RAID0 volume so they act the same as if the controller were in HBA mode. MB4000FCWDK drives are 4 TB, dual-port, 3.5 inch, 6G, 7200 RPM, midline SAS HDDs. The remaining two VDEVs each have twelve MM2000GEFRA drives in an external D3700 enclosure connected to an embedded P431 storage controller (in HBA mode). MM2000GEFRA drives are 2 TB, single-port, 2.5 inch, 6G, 7200 RPM, midline SATA HDDs. For completeness, the DL380p has 12 bays in the front. The first two drive are 2 TB drives in a hardware RAID1 array and the resulting volume is partitioned for the boot and and OS filesystems. The D3700 has 25 bays and the last bay is a spare drive attached to the pool as a hot-spare.

The machine was cobbled together over time from available parts so It isn't an ideal configuration.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.