openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.55k stars 1.74k forks source link

0.7.0-rc2 txg_sync spins 100% cpu in function ddt_get_dedup_histogram when low memory but i don't have dedup #5400

Closed mailinglists35 closed 3 years ago

mailinglists35 commented 7 years ago

100% cpu wouldn't bother me but when it occurs I cannot read/write data on the pool except zpool/zfs get commands perf top -p $pidof txg_sync shows this

Samples: 78K of event 'cycles:pp', Event count (approx.): 14287937665
Overhead  Shared O  Symbol
  52.58%  [kernel]  [k] ddt_get_dedup_histogram
   3.09%  [kernel]  [k] mutex_lock
   2.76%  [kernel]  [k] osq_lock
   2.69%  [kernel]  [k] mutex_spin_on_owner.isra.3
   2.60%  [kernel]  [k] _raw_spin_lock
   2.29%  [kernel]  [k] mutex_unlock
   2.07%  [kernel]  [k] vdev_get_stats_ex_impl
   1.51%  [kernel]  [k] ddt_histogram_stat
   1.08%  [kernel]  [k] txg_sync_thread
   1.01%  [kernel]  [k] zio_create
   0.90%  [kernel]  [k] __kmalloc_node
   0.77%  [kernel]  [k] spa_sync
   0.76%  [kernel]  [k] __cv_broadcast
   0.71%  [kernel]  [k] update_blocked_averages
   0.68%  [kernel]  [k] kfree
   0.62%  [kernel]  [k] zap_hash
   0.62%  [kernel]  [k] ddt_sync
   0.59%  [kernel]  [k] spa_config_enter
   0.45%  [kernel]  [k] spa_config_exit
   0.44%  [kernel]  [k] zio_done
   0.43%  [kernel]  [k] __memset
   0.43%  [kernel]  [k] dbuf_hash
   0.40%  [kernel]  [k] __bitmap_weight
   0.39%  [kernel]  [k] dbuf_find
   0.38%  [kernel]  [k] _raw_spin_lock_irqsave
   0.35%  [kernel]  [k] perf_pmu_sched_task
   0.33%  [kernel]  [k] zio_wait
   0.32%  [kernel]  [k] kmem_cache_alloc
   0.31%  [kernel]  [k] down_read
   0.30%  [kernel]  [k] zap_leaf_lookup
   0.29%  [kernel]  [k] _cond_resched
   0.28%  [kernel]  [k] dsl_scan_active
   0.27%  [kernel]  [k] zio_wait_for_children
   0.27%  [kernel]  [k] zrl_add
   0.26%  [kernel]  [k] spl_kmem_cache_alloc
   0.26%  [kernel]  [k] __dbuf_hold_impl

--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
spl_vn_cache                          0x00020      8192       104     8192      104      1     1     7     42     1   256      0     0     0
spl_vn_file_cache                     0x00020      8192       112     8192      112      1     1     7     42     1   256      0     0     0
spl_zlib_workspace_cache              0x00240         0         0  2145216   268104      0     0     0      0     0     0      0     0     0
ddt_cache                             0x00040    398464    347984   199232    24856      2     2     6     16    14    48      0     0     0
zio_buf_20480                         0x00042         0         0   200704    20480      0     0    18      0     0   144      0     0     0
zio_data_buf_20480                    0x00042   1806336   1474560   200704    20480      9     9    51     72    72   408      0     0     0
zio_buf_24576                         0x00042         0         0   233472    24576      0     0    19      0     0   152      0     0     0
zio_data_buf_24576                    0x00042    466944     98304   233472    24576      2     2    59     16     4   472      0     0     0
zio_buf_28672                         0x00042         0         0   266240    28672      0     0    26      0     0   208      0     0     0
zio_data_buf_28672                    0x00042   1863680    716800   266240    28672      7     7    58     56    25   464      0     0     0
zio_buf_32768                         0x00042   1196032   1048576   299008    32768      4     4    49     32    32   392      0     0     0
zio_data_buf_32768                    0x00042   2093056   1605632   299008    32768      7     7    38     56    49   304      0     0     0
zio_buf_40960                         0x00042         0         0   364544    40960      0     0    19      0     0   152      0     0     0
zio_data_buf_40960                    0x00042   3645440   2662400   364544    40960     10    10    87     80    65   696      0     0     0
zio_buf_49152                         0x00042         0         0   430080    49152      0     0    12      0     0    96      0     0     0
zio_data_buf_49152                    0x00042   3440640   2850816   430080    49152      8     8   122     64    58   976      0     0     0
zio_buf_57344                         0x00042         0         0   495616    57344      0     0     9      0     0    72      0     0     0
zio_data_buf_57344                    0x00042   3469312   1949696   495616    57344      7     7   124     56    34   992      0     0     0
zio_buf_65536                         0x00042         0         0   561152    65536      0     0     7      0     0    56      0     0     0
zio_data_buf_65536                    0x00042   2805760   1507328   561152    65536      5     5   154     40    23  1232      0     0     0
zio_buf_81920                         0x00042         0         0   692224    81920      0     0    14      0     0   112      0     0     0
zio_data_buf_81920                    0x00042  10383360   7372800   692224    81920     15    15   133    120    90  1064      0     0     0
zio_buf_98304                         0x00042         0         0   823296    98304      0     0    11      0     0    88      0     0     0
zio_data_buf_98304                    0x00042  13172736   9928704   823296    98304     16    16   142    128   101  1136      0     0     0
zio_buf_114688                        0x00042         0         0   954368   114688      0     0     9      0     0    72      0     0     0
zio_data_buf_114688                   0x00042   6680576   5619712   954368   114688      7     7    34     56    49   272      0     0     0
zio_buf_131072                        0x00042 142192640 133562368  1085440   131072    131   131   182   1048  1019  1456      0     0     0
zio_data_buf_131072                   0x00042  97689600  85721088  1085440   131072     90    90   241    720   654  1928      0     0     0
zio_buf_163840                        0x00042         0         0  1347584   163840      0     0     0      0     0     0      0     0     0
zio_data_buf_163840                   0x00042         0         0  1347584   163840      0     0     0      0     0     0      0     0     0
zio_buf_196608                        0x00042         0         0  1609728   196608      0     0     0      0     0     0      0     0     0
zio_data_buf_196608                   0x00042         0         0  1609728   196608      0     0     0      0     0     0      0     0     0
zio_buf_229376                        0x00042         0         0  1871872   229376      0     0     0      0     0     0      0     0     0
zio_data_buf_229376                   0x00042         0         0  1871872   229376      0     0     0      0     0     0      0     0     0
zio_buf_262144                        0x00042         0         0  2134016   262144      0     0     0      0     0     0      0     0     0
zio_data_buf_262144                   0x00042         0         0  2134016   262144      0     0     0      0     0     0      0     0     0
zio_buf_327680                        0x00042         0         0  2658304   327680      0     0     0      0     0     0      0     0     0
zio_data_buf_327680                   0x00042         0         0  2658304   327680      0     0     0      0     0     0      0     0     0
zio_buf_393216                        0x00042         0         0  3182592   393216      0     0     0      0     0     0      0     0     0
zio_data_buf_393216                   0x00042         0         0  3182592   393216      0     0     0      0     0     0      0     0     0
zio_buf_458752                        0x00042         0         0  3706880   458752      0     0     0      0     0     0      0     0     0
zio_data_buf_458752                   0x00042         0         0  3706880   458752      0     0     0      0     0     0      0     0     0
zio_buf_524288                        0x00042         0         0  4231168   524288      0     0     0      0     0     0      0     0     0
zio_data_buf_524288                   0x00042         0         0  4231168   524288      0     0     0      0     0     0      0     0     0
zio_buf_655360                        0x00042         0         0  5279744   655360      0     0     0      0     0     0      0     0     0
zio_data_buf_655360                   0x00042         0         0  5279744   655360      0     0     0      0     0     0      0     0     0
zio_buf_786432                        0x00042         0         0  6328320   786432      0     0     0      0     0     0      0     0     0
zio_data_buf_786432                   0x00042         0         0  6328320   786432      0     0     0      0     0     0      0     0     0
zio_buf_917504                        0x00042         0         0  7376896   917504      0     0     0      0     0     0      0     0     0
zio_data_buf_917504                   0x00042         0         0  7376896   917504      0     0     0      0     0     0      0     0     0
zio_buf_1048576                       0x00042         0         0  8425472  1048576      0     0     0      0     0     0      0     0     0
zio_data_buf_1048576                  0x00042         0         0  8425472  1048576      0     0     0      0     0     0      0     0     0
zio_buf_1310720                       0x00042         0         0 10522624  1310720      0     0     0      0     0     0      0     0     0
zio_data_buf_1310720                  0x00042         0         0 10522624  1310720      0     0     0      0     0     0      0     0     0
zio_buf_1572864                       0x00042         0         0 12619776  1572864      0     0     0      0     0     0      0     0     0
zio_data_buf_1572864                  0x00042         0         0 12619776  1572864      0     0     0      0     0     0      0     0     0
zio_buf_1835008                       0x00042         0         0 14716928  1835008      0     0     0      0     0     0      0     0     0
zio_data_buf_1835008                  0x00042         0         0 14716928  1835008      0     0     0      0     0     0      0     0     0
zio_buf_2097152                       0x00042         0         0 16814080  2097152      0     0     0      0     0     0      0     0     0
zio_data_buf_2097152                  0x00042         0         0 16814080  2097152      0     0     0      0     0     0      0     0     0
zio_buf_2621440                       0x00042         0         0 21008384  2621440      0     0     0      0     0     0      0     0     0
zio_data_buf_2621440                  0x00042         0         0 21008384  2621440      0     0     0      0     0     0      0     0     0
zio_buf_3145728                       0x00042         0         0 25202688  3145728      0     0     0      0     0     0      0     0     0
zio_data_buf_3145728                  0x00042         0         0 25202688  3145728      0     0     0      0     0     0      0     0     0
zio_buf_3670016                       0x00042         0         0 29396992  3670016      0     0     0      0     0     0      0     0     0
zio_data_buf_3670016                  0x00042         0         0 29396992  3670016      0     0     0      0     0     0      0     0     0
zio_buf_4194304                       0x00042         0         0 29392896  4194304      0     0     0      0     0     0      0     0     0
zio_data_buf_4194304                  0x00042         0         0 29392896  4194304      0     0     0      0     0     0      0     0     0
zio_buf_5242880                       0x00042         0         0 31485952  5242880      0     0     0      0     0     0      0     0     0
zio_data_buf_5242880                  0x00042         0         0 31485952  5242880      0     0     0      0     0     0      0     0     0
zio_buf_6291456                       0x00042         0         0 31481856  6291456      0     0     0      0     0     0      0     0     0
zio_data_buf_6291456                  0x00042         0         0 31481856  6291456      0     0     0      0     0     0      0     0     0
zio_buf_7340032                       0x00042         0         0 29380608  7340032      0     0     0      0     0     0      0     0     0
zio_data_buf_7340032                  0x00042         0         0 29380608  7340032      0     0     0      0     0     0      0     0     0
zio_buf_8388608                       0x00042         0         0 25182208  8388608      0     0     0      0     0     0      0     0     0
zio_data_buf_8388608                  0x00042         0         0 25182208  8388608      0     0     0      0     0     0      0     0     0
zio_buf_10485760                      0x00042         0         0 31473664 10485760      0     0     0      0     0     0      0     0     0
zio_data_buf_10485760                 0x00042         0         0 31473664 10485760      0     0     0      0     0     0      0     0     0
zio_buf_12582912                      0x00042         0         0 25178112 12582912      0     0     0      0     0     0      0     0     0
zio_data_buf_12582912                 0x00042         0         0 25178112 12582912      0     0     0      0     0     0      0     0     0
zio_buf_14680064                      0x00042         0         0 29372416 14680064      0     0     0      0     0     0      0     0     0
zio_data_buf_14680064                 0x00042         0         0 29372416 14680064      0     0     0      0     0     0      0     0     0
zio_buf_16777216                      0x00042         0         0 16785408 16777216      0     0     0      0     0     0      0     0     0
zio_data_buf_16777216                 0x00042         0         0 16785408 16777216      0     0     0      0     0     0      0     0     0

------------------------------------------------------------------------
ZFS Subsystem Report                                Tue Nov 15 11:31:23 2016
ARC Summary: (HEALTHY)
    Memory Throttle Count:                  0

ARC Misc:
    Deleted:                                271.60k
    Mutex Misses:                           61.75k
    Evict Skips:                            61.75k

ARC Size:                           27.50%  1.05    GiB
    Target Size: (Adaptive)         0.82%   32.00   MiB
    Min Size (Hard Limit):          0.82%   32.00   MiB
    Max Size (High Water):          122:1   3.83    GiB

ARC Size Breakdown:
    Recently Used Cache Size:       0.07%   760.33  KiB
    Frequently Used Cache Size:     99.93%  1.05    GiB

ARC Hash Breakdown:
    Elements Max:                           108.59k
    Elements Current:               33.89%  36.80k
    Collisions:                             101.37k
    Chain Max:                              4
    Chains:                                 597

ARC Total accesses:                                 3.61m
    Cache Hit Ratio:                38.37%  1.39m
    Cache Miss Ratio:               61.63%  2.23m
    Actual Hit Ratio:               37.28%  1.35m

    Data Demand Efficiency:         57.94%  1.73m
    Data Prefetch Efficiency:       85.80%  65.33k

    CACHE HITS BY CACHE LIST:
      Most Recently Used:           4.17%   57.79k
      Most Frequently Used:         92.97%  1.29m
      Most Recently Used Ghost:     4.10%   56.89k
      Most Frequently Used Ghost:   37.25%  516.42k

    CACHE HITS BY DATA TYPE:
      Demand Data:                  72.32%  1.00m
      Prefetch Data:                4.04%   56.06k
      Demand Metadata:              21.98%  304.73k
      Prefetch Metadata:            1.65%   22.93k

    CACHE MISSES BY DATA TYPE:
      Demand Data:                  32.69%  727.80k
      Prefetch Data:                0.42%   9.28k
      Demand Metadata:              65.67%  1.46m
      Prefetch Metadata:            1.22%   27.25k

MemTotal:        8030788 kB
MemFree:          160124 kB
MemAvailable:    1007232 kB
Buffers:            8128 kB
Cached:           809028 kB
SwapCached:       106308 kB
Active:           934544 kB
Inactive:         968884 kB
Active(anon):     486436 kB
Inactive(anon):   615180 kB
Active(file):     448108 kB
Inactive(file):   353704 kB
Unevictable:       18076 kB
Mlocked:           18076 kB
SwapTotal:       8388604 kB
SwapFree:        5709976 kB
Dirty:                24 kB
Writeback:             0 kB
AnonPages:       1089000 kB
Mapped:           730832 kB
Shmem:             11324 kB
Slab:            3311688 kB
SReclaimable:     351260 kB
SUnreclaim:      2960428 kB
KernelStack:      247392 kB
PageTables:       406888 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12403996 kB
Committed_AS:   107995212 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:      6144 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     5554788 kB
DirectMap2M:     2824192 kB

11 1 0x01 92 4416 15200916354 328514879975550
name                            type data
hits                            4    1386443
misses                          4    2226447
demand_data_hits                4    1002735
demand_data_misses              4    727798
demand_metadata_hits            4    304727
demand_metadata_misses          4    1462123
prefetch_data_hits              4    56056
prefetch_data_misses            4    9279
prefetch_metadata_hits          4    22925
prefetch_metadata_misses        4    27247
mru_hits                        4    57790
mru_ghost_hits                  4    56888
mfu_hits                        4    1288948
mfu_ghost_hits                  4    516425
deleted                         4    271603
mutex_miss                      4    61753
evict_skip                      4    20788228
evict_not_enough                4    2098855
evict_l2_cached                 4    0
evict_l2_eligible               4    7344444928
evict_l2_ineligible             4    877400064
evict_l2_skip                   4    0
hash_elements                   4    36798
hash_elements_max               4    108593
hash_collisions                 4    101371
hash_chains                     4    597
hash_chain_max                  4    4
p                               4    778576
c                               4    33554432
c_min                           4    33554432
c_max                           4    4111763456
size                            4    1130913888
compressed_size                 4    116959232
uncompressed_size               4    693078016
overhead_size                   4    751551488
hdr_size                        4    15365744
data_size                       4    100401152
metadata_size                   4    768109568
dbuf_size                       4    57442624
dnode_size                      4    140587440
bonus_size                      4    49007360
anon_size                       4    134647296
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    624236032
mru_evictable_data              4    0
mru_evictable_metadata          4    0
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    109627392
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    33501184
mfu_ghost_evictable_data        4    131072
mfu_ghost_evictable_metadata    4    33370112
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    6234
memory_indirect_count           4    21782
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    171950
arc_meta_used                   4    1030512736
arc_meta_limit                  4    3083822592
arc_dnode_limit                 4    308382259
arc_meta_max                    4    1756537704
arc_meta_min                    4    16777216
sync_wait_for_async             4    299
demand_hit_predictive_prefetch  4    3832
arc_need_free                   4    0
arc_sys_free                    4    128491520

sometimes it goes away after 3 > drop_caches (or maybe just a coincidence and gone away by itself), most other times i must reboot.

mailinglists35 commented 7 years ago

zpool get all | grep -v default

NAME     PROPERTY                    VALUE                       SOURCE
pool  size                        1008G                       -
pool  capacity                    9%                          -
pool  health                      ONLINE                      -
pool  guid                        7367461532570993000         -
pool  failmode                    continue                    local
pool  dedupratio                  1.00x                       -
pool  free                        911G                        -
pool  allocated                   96.7G                       -
pool  readonly                    off                         -
pool  comment                                                 local
pool  expandsize                  -                           -
pool  freeing                     0                           -
pool  fragmentation               5%                          -
pool  leaked                      0                           -
pool  feature@async_destroy       enabled                     local
pool  feature@empty_bpobj         active                      local
pool  feature@lz4_compress        active                      local
pool  feature@spacemap_histogram  active                      local
pool  feature@enabled_txg         active                      local
pool  feature@hole_birth          active                      local
pool  feature@extensible_dataset  enabled                     local
pool  feature@embedded_data       active                      local
pool  feature@bookmarks           enabled                     local
pool  feature@filesystem_limits   enabled                     local
pool  feature@large_blocks        enabled                     local
pool  feature@large_dnode         disabled                    local
pool  feature@sha512              disabled                    local
pool  feature@skein               disabled                    local
pool  feature@edonr               disabled                    local
pool  feature@userobj_accounting  disabled                    local

  pool: pool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 0h6m with 0 errors on Mon Nov  7 16:24:43 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    pool                                         ONLINE       0     0     0
      scsi-3600508e000000000aa5b6d24a4b4ad04-part8  ONLINE       0     0     0

errors: No known data errors
mailinglists35 commented 7 years ago

on the pool i have 600 small containers running https://github.com/lxc/lxd master (each container is a zfs clone of a dataset, looks like this:

zfs get origin 
[...]
pool/lxd/containers/terminal99      origin  pool/lxd/images/d0e4b74306ae904d45efc81c972deee4b862c5af6d753ed99cfb98fee95a5fba@readonly       -
[...600 identical from 1 to 600...]

additionally on the system there is zfs-auto-snapshot script active

mailinglists35 commented 7 years ago

please let me know if you need any other additional info from the system

behlendorf commented 7 years ago

@mailinglists35 this could be explained by a process, or many processes, repeatedly fetching the pool statistics. i.e. zpool status 0.01 or some such. The high cpu usage of vdev_get_stats_ex_impl is what makes me suspect something like this.

mailinglists35 commented 7 years ago

@behlendorf zpool status 0.001 does allow me to use the pool and I don't see txg_sync spinning 100% (actually i don't see it along the list of top cpu consumers), while in the above case i can't use the pool (file i/o hangs) and txg_sync stays at 100%

here is zpool 0.001 'perf top -p $pidof txg_sync

Overhead  Shared O  Symbol
  65.01%  [kernel]  [k] ddt_get_dedup_histogram
  26.45%  [kernel]  [k] zio_done
   7.60%  [kernel]  [k] mutex_lock
   0.81%  [kernel]  [k] ioremap_page_range
   0.12%  [kernel]  [k] nmi_cpu_backtrace
   0.01%  [kernel]  [k] native_write_msr_safe

instead zpool is here using the most cpu

PID  SYSCPU  USRCPU  VGROW  RGROW  RDDSK   WRDSK ST EXC S  CPU  CMD        1/1
 28770   0.18s   0.04s     0K     0K     0K      0K --   - S  23%  zpool
mailinglists35 commented 7 years ago

@behlendorf to rule out the case of someone repeatedly fetching pool statistics, is it simple to add rate limiting to fetching pool statistics? something like 1 per second? in that case, should the pool continue to be usable even if someone attempts repeatedly to fetch statistics?

mailinglists35 commented 7 years ago

update: noticed this happens if i launch many (~200) lxd containers at once. there is enough free and cached memory. the containers are running debian trusty with no daemons (just init and dhclient)

ironMann commented 7 years ago

I have seen this before in my profilings but I was not paying too much attention. @mailinglists35 Can you collect call trace for ddt_get_dedup_histogram() (perf top -ag)?

This is one possibility:

txg_sync()
  spa_sync()
    spa_update_dspace()
      ddt_get_dedup_dspace()
        ddt_get_dedup_stats()
          ddt_get_dedup_histogram()
behlendorf commented 7 years ago

A call trace would be very helpful. Another possible option which makes sense would be the following. I think it may be more likely because I see vdev_get_stats_ex_impl consuming a lot of time in your perf results. Stalling the txg sync thread would also lead to the hangs your describing.

txg_sync_thread()
  vdev_get_stats()
    vdev_get_stats_ex()
      vdev_get_stats_ex_impl()

Edit: Could you apply the #5412 PR and see if it resolves the issue.

ironMann commented 7 years ago

Here is a small patch for ddt_get_dedup_dspace(). The method goes through 168KiB of ddt histograms on each invocation. The return value of ddt_get_dedup_dspace() can be memoized if there were no changes in ddt_histogram_cache.

diff --git a/include/sys/spa_impl.h b/include/sys/spa_impl.h
index 88bde98..89ca524 100644
--- a/include/sys/spa_impl.h
+++ b/include/sys/spa_impl.h
@@ -239,6 +239,7 @@ struct spa {
    uint64_t    spa_autoexpand;     /* lun expansion on/off */
    ddt_t       *spa_ddt[ZIO_CHECKSUM_FUNCTIONS]; /* in-core DDTs */
    uint64_t    spa_ddt_stat_object;    /* DDT statistics */
+   uint64_t    spa_dedup_dspace;   /* Memoize get_dedup_dspace() */
    uint64_t    spa_dedup_ditto;    /* dedup ditto threshold */
    uint64_t    spa_dedup_checksum; /* default dedup checksum */
    uint64_t    spa_dspace;     /* dspace in normal class */
diff --git a/module/zfs/ddt.c b/module/zfs/ddt.c
index 09a3536..6335e00 100644
--- a/module/zfs/ddt.c
+++ b/module/zfs/ddt.c
@@ -528,10 +528,17 @@ ddt_get_dedup_stats(spa_t *spa, ddt_stat_t *dds_total)
 uint64_t
 ddt_get_dedup_dspace(spa_t *spa)
 {
-   ddt_stat_t dds_total = { 0 };
+   ddt_stat_t dds_total;
+
+   if (spa->spa_dedup_dspace != ~0ULL)
+       return spa->spa_dedup_dspace;
+
+   /* Calculate and memoize the stats */
+   bzero(&dds_total, sizeof (ddt_stat_t));

    ddt_get_dedup_stats(spa, &dds_total);
-   return (dds_total.dds_ref_dsize - dds_total.dds_dsize);
+   spa->spa_dedup_dspace = dds_total.dds_ref_dsize - dds_total.dds_dsize;
+   return (spa->spa_dedup_dspace);
 }

 uint64_t
@@ -915,6 +922,7 @@ ddt_load(spa_t *spa)
         */
        bcopy(ddt->ddt_histogram, &ddt->ddt_histogram_cache,
            sizeof (ddt->ddt_histogram));
+       spa->spa_dedup_dspace = ~0ULL;
    }

    return (0);
@@ -1182,6 +1190,7 @@ ddt_sync_table(ddt_t *ddt, dmu_tx_t *tx, uint64_t txg)

    bcopy(ddt->ddt_histogram, &ddt->ddt_histogram_cache,
        sizeof (ddt->ddt_histogram));
+   spa->spa_dedup_dspace = ~0ULL;
 }

 void
diff --git a/module/zfs/spa_misc.c b/module/zfs/spa_misc.c
index 6ec0521..1857b9a 100644
--- a/module/zfs/spa_misc.c
+++ b/module/zfs/spa_misc.c
@@ -635,6 +635,9 @@ spa_add(const char *name, nvlist_t *config, const char *altroot)
    spa->spa_min_ashift = INT_MAX;
    spa->spa_max_ashift = 0;

+   /* Reset memoized value */
+   spa->spa_dedup_dspace = ~0ULL;
+
    /*
     * As a pool is being created, treat all features as disabled by
     * setting SPA_FEATURE_DISABLED for all entries in the feature
mailinglists35 commented 7 years ago

thank you for the quick replies. I am away from that system over the weeked and I will test the patches on monday

behlendorf commented 7 years ago

I've opened PR #5412 with a cleaner version of the same change. Please use that when testing. You'll also want to make sure that zfs_txg_history=0, this is the default value.

mailinglists35 commented 7 years ago

@behlendorf I have applied PR #5412 but there is no change. The perf -ag shows the call path suggested by @ironMann @ironMann, I will try your patch now (output below is just with PR 5412 applied)

Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
  Children      Self  Command        Shared Object        Symbol                                                                                            ◆
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] thread_generic_wrapper                                                                        ▒
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] kthread                                                                                       ▒
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] ret_from_fork                                                                                 ▒
+   36.07%     0.41%  txg_sync       [kernel.kallsyms]    [k] txg_sync_thread                                                                               ▒
+   33.93%     0.29%  txg_sync       [kernel.kallsyms]    [k] spa_sync                                                                                      ▒
+   24.88%     0.00%  lxd            [kernel.kallsyms]    [k] entry_SYSCALL_64_fastpath                                                                     ▒
+   24.31%     0.01%  txg_sync       [kernel.kallsyms]    [k] spa_update_dspace                                                                             ▒
+   24.28%     0.02%  txg_sync       [kernel.kallsyms]    [k] ddt_get_dedup_dspace                                                                          ▒
+   24.24%     0.02%  txg_sync       [kernel.kallsyms]    [k] ddt_get_dedup_stats                                                                           ▒
-   23.15%    23.15%  txg_sync       [kernel.kallsyms]    [k] ddt_get_dedup_histogram                                                                       ▒
     ret_from_fork                                                                                                                                          ▒
     kthread                                                                                                                                                ▒
     thread_generic_wrapper                                                                                                                                 ▒
     txg_sync_thread                                                                                                                                        ▒
     spa_sync                                                                                                                                               ▒
     spa_update_dspace                                                                                                                                      ▒
   - ddt_get_dedup_dspace                                                                                                                                   ▒
      - 23.15% ddt_get_dedup_stats                                                                                                                          ▒
           ddt_get_dedup_histogram                                                                                                                          ▒
        0.00% ddt_get_dedup_histogram                                                                                                                       ▒

Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
  Children      Self  Command        Shared Object        Symbol                                                                                            ◆
-   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] thread_generic_wrapper                                                                        ▒
   - thread_generic_wrapper                                                                                                                                 ▒
      - 36.07% txg_sync_thread                                                                                                                              ▒
         - 33.92% spa_sync                                                                                                                                  ▒
            - 24.31% spa_update_dspace                                                                                                                      ▒
               - 24.28% ddt_get_dedup_dspace                                                                                                                ▒
                  - 24.24% ddt_get_dedup_stats                                                                                                              ▒
                       23.15% ddt_get_dedup_histogram                                                                                                       ▒
                       0.63% ddt_histogram_stat

--

Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
  Children      Self  Command        Shared Object        Symbol                                                                                            ◆
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] thread_generic_wrapper                                                                        ▒
-   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] kthread                                                                                       ▒
     kthread                                                                                                                                                ▒
   - thread_generic_wrapper                                                                                                                                 ▒
      - 36.07% txg_sync_thread                                                                                                                              ▒
         - 33.92% spa_sync                                                                                                                                  ▒
            - 24.31% spa_update_dspace                                                                                                                      ▒
               - 24.28% ddt_get_dedup_dspace                                                                                                                ▒
                  - 24.24% ddt_get_dedup_stats                                                                                                              ▒
                       23.15% ddt_get_dedup_histogram                                                                                                       ▒

Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
  Children      Self  Command        Shared Object        Symbol                                                                                            ◆
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] thread_generic_wrapper                                                                        ▒
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] kthread                                                                                       ▒
-   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] ret_from_fork                                                                                 ▒
     ret_from_fork                                                                                                                                          ▒
     kthread                                                                                                                                                ▒
   - thread_generic_wrapper                                                                                                                                 ▒
      - 36.07% txg_sync_thread                                                                                                                              ▒
         - 33.92% spa_sync                                                                                                                                  ▒
            - 24.31% spa_update_dspace                                                                                                                      ▒
               - 24.28% ddt_get_dedup_dspace                                                                                                                ▒
                  - 24.24% ddt_get_dedup_stats                                                                                                              ▒
                       23.15% ddt_get_dedup_histogram                                                                                                       ▒

Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
  Children      Self  Command        Shared Object        Symbol                                                                                            ◆
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] thread_generic_wrapper                                                                        ▒
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] kthread                                                                                       ▒
+   36.18%     0.00%  txg_sync       [kernel.kallsyms]    [k] ret_from_fork                                                                                 ▒
+   36.07%     0.41%  txg_sync       [kernel.kallsyms]    [k] txg_sync_thread                                                                               ▒
+   33.93%     0.29%  txg_sync       [kernel.kallsyms]    [k] spa_sync                                                                                      ▒
+   24.88%     0.00%  lxd            [kernel.kallsyms]    [k] entry_SYSCALL_64_fastpath                                                                     ▒
+   24.31%     0.01%  txg_sync       [kernel.kallsyms]    [k] spa_update_dspace                                                                             ▒
+   24.28%     0.02%  txg_sync       [kernel.kallsyms]    [k] ddt_get_dedup_dspace                                                                          ▒
+   24.24%     0.02%  txg_sync       [kernel.kallsyms]    [k] ddt_get_dedup_stats                                                                           ▒
+   23.15%    23.15%  txg_sync       [kernel.kallsyms]    [k] ddt_get_dedup_histogram                                                                       ▒
+   20.93%     0.00%  lxd            libc-2.23.so         [.] fchownat                                                                                      ▒
+   20.26%     0.00%  lxd            [kernel.kallsyms]    [k] zfs_setattr                                                                                   ▒
+   20.26%     0.00%  lxd            [kernel.kallsyms]    [k] zpl_setattr                                                                                   ▒
+   20.26%     0.00%  lxd            [kernel.kallsyms]    [k] notify_change                                                                                 ▒
+   20.26%     0.00%  lxd            [kernel.kallsyms]    [k] chown_common                                                                                  ▒
+   20.26%     0.00%  lxd            [kernel.kallsyms]    [k] sys_fchownat                                                                                  ▒
+   20.24%     0.40%  lxd            [kernel.kallsyms]    [k] dmu_tx_assign                                                                                 ▒
+   19.23%     0.09%  lxd            [kernel.kallsyms]    [k] dmu_tx_wait                                                                                   ▒
+   18.57%     0.11%  lxd            [kernel.kallsyms]    [k] txg_wait_open                                                                                 ▒
+   17.27%     0.00%  init           [kernel.kallsyms]    [k] zfs_write                                                                                     ▒
+   17.27%     0.00%  init           [kernel.kallsyms]    [k] zpl_write_common_iovec                                                                        ▒
+   17.27%     0.00%  init           [kernel.kallsyms]    [k] zpl_write                                                                                     ▒
+   17.27%     0.00%  init           [kernel.kallsyms]    [k] __vfs_write                                                                                   ▒
+   17.27%     0.00%  init           [kernel.kallsyms]    [k] vfs_write                                                                                     ▒
+   17.27%     0.00%  init           [kernel.kallsyms]    [k] sys_write                                                                                     ▒
+   17.27%     0.00%  init           [kernel.kallsyms]    [k] entry_SYSCALL_64_fastpath                                                                     ▒
+   17.08%     0.39%  init           [kernel.kallsyms]    [k] dmu_tx_assign                                                                                 ▒
+   14.70%     0.08%  init           [kernel.kallsyms]    [k] dmu_tx_wait                                                                                   ▒
+   14.22%     0.07%  init           [kernel.kallsyms]    [k] txg_wait_open                                                                                 ▒
+   10.87%     0.01%  lxd            [kernel.kallsyms]    [k] __cv_wait                                                                                     ▒
+   10.76%     0.88%  lxd            [kernel.kallsyms]    [k] mutex_lock                                                                                    ▒
+   10.74%     0.47%  lxd            [kernel.kallsyms]    [k] cv_wait_common                                                                                ▒
+    9.89%     0.04%  lxd            [kernel.kallsyms]    [k] __mutex_lock_slowpath                                                                         ▒
+    9.80%     0.35%  lxd            [kernel.kallsyms]    [k] mutex_optimistic_spin                                                                         ▒
+    9.77%     0.01%  init           [kernel.kallsyms]    [k] __cv_wait                                                                                     ▒
+    9.70%     0.28%  init           [kernel.kallsyms]    [k] cv_wait_common                                                                                ▒
+    8.13%     0.57%  init           [kernel.kallsyms]    [k] mutex_lock                                                                                    ▒
+    7.57%     0.03%  init           [kernel.kallsyms]    [k] __mutex_lock_slowpath                                                                         ▒
+    7.46%     0.28%  init           [kernel.kallsyms]    [k] mutex_optimistic_spin                                                                         ▒
+    7.41%     0.00%  txg_quiesce    [kernel.kallsyms]    [k] thread_generic_wrapper                                                                        ▒
+    7.41%     0.00%  txg_quiesce    [kernel.kallsyms]    [k] kthread                                                                                       ▒
+    7.41%     0.00%  txg_quiesce    [kernel.kallsyms]    [k] ret_from_fork                                                                                 ▒
+    7.14%     0.28%  txg_quiesce    [kernel.kallsyms]    [k] txg_quiesce_thread                                                                            ▒
+    5.58%     5.58%  lxd            [kernel.kallsyms]    [k] mutex_spin_on_owner.isra.3                                                                    ▒
+    5.09%     0.70%  lxd            [kernel.kallsyms]    [k] __schedule                                                                                    ▒
+    4.80%     0.07%  lxd            [kernel.kallsyms]    [k] schedule                                                                                      ▒
+    4.68%     0.02%  txg_quiesce    [kernel.kallsyms]    [k] __wake_up                                                                                     ▒
+    4.61%     0.02%  txg_quiesce    [kernel.kallsyms]    [k] __cv_broadcast                                                                                ▒
+    4.58%     0.79%  init           [kernel.kallsyms]    [k] __schedule                                                                                    ▒
+    4.53%     0.14%  txg_quiesce    [kernel.kallsyms]    [k] __wake_up_common                                                                              ▒
+    4.42%     0.08%  init           [kernel.kallsyms]    [k] schedule                                                                                      ▒
+    4.39%     0.08%  txg_quiesce    [kernel.kallsyms]    [k] autoremove_wake_function                                                                      ▒
+    4.39%     0.00%  swapper        [kernel.kallsyms]    [k] rest_init                                                                                     ▒
+    4.39%     0.00%  swapper        [kernel.kallsyms]    [k] start_kernel                                                                                  ▒
+    4.39%     0.00%  swapper        [kernel.kallsyms]    [k] x86_64_start_reservations                                                                     ▒
+    4.39%     0.00%  swapper        [kernel.kallsyms]    [k] x86_64_start_kernel                                                                           ▒
+    4.35%     0.11%  swapper        [kernel.kallsyms]    [k] cpu_startup_entry                                                                             ▒
+    4.32%     0.00%  lxd            libpthread-2.23.so   [.] __GI___libc_open                                                                              ▒
+    4.30%     0.01%  txg_quiesce    [kernel.kallsyms]    [k] default_wake_function                                                                         ▒
mailinglists35 commented 7 years ago

@ironMann it will take me another week until I will manage to learn how to apply your patch. I finally managed to learn how to apply PR 5412 with git apply --check / git apply "downloaded file <URL of behlendorf zfs fork with issue 5400>.patch" but on yours I tried copy/pasting the code as file.patch and ran patch -p1 < /path/to/file.patch inside zfs-dkms src dir, but it fails me with below output even after reverting pr 5412 (also git apply fails)

# patch --dry-run -p1 < /usr/local/src/zfs_ddt.patch
checking file include/sys/spa_impl.h
Hunk #1 FAILED at 239.
1 out of 1 hunk FAILED
checking file module/zfs/ddt.c
Hunk #1 FAILED at 528.
Hunk #2 FAILED at 915.
Hunk #3 FAILED at 1182.
3 out of 3 hunks FAILED
checking file module/zfs/spa_misc.c
Hunk #1 FAILED at 635.
1 out of 1 hunk FAILED
ironMann commented 7 years ago

@mailinglists35 checkout a clean copy of rc2 and git apply should finish without errors

On Nov 25, 2016 09:38, "mailinglists35" notifications@github.com wrote:

@ironMann https://github.com/ironMann it will take me another week until I will manage to learn how to apply your patch. I managed to apply PR 5412 with git apply --check / git apply "<URL of behlendorf zfs fork with issue 5400>.patch" but on yours I tried copy/pasting the code as file.patch and ran patch -p1 < /path/to/file.patch inside zfs-dkms src dir, but it fails me with below output even after reverting pr 5412

patch --dry-run -p1 < /usr/local/src/zfs_ddt.patch

checking file include/sys/spa_impl.h Hunk #1 FAILED at 239. 1 out of 1 hunk FAILED checking file module/zfs/ddt.c Hunk #1 FAILED at 528. Hunk #2 FAILED at 915. Hunk #3 FAILED at 1182. 3 out of 3 hunks FAILED checking file module/zfs/spa_misc.c Hunk #1 FAILED at 635. 1 out of 1 hunk FAILED

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/5400#issuecomment-262906005, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQ4zLD7hiDKOjOvtBJxleknTrI_jkvkks5rBp6NgaJpZM4KyTre .

mailinglists35 commented 7 years ago

@ironMann these are the steps i'm doing:

select the text from your comment using google chrome on windows, and paste on pastebinit as http://pastebin.com/raw/W7znZ5X7

open putty, connect to server vi patch.txt, paste selected text wget --content-disposition http://pastebin.com/raw/W7znZ5X7 dos2unix W7znZ5X7 wget --content-disposition https://github.com/zfsonlinux/zfs/releases/download/zfs-0.7.0-rc2/zfs-0.7.0-rc2.tar.gz tar -xzvf zfs-0.7.0-rc2.tar.gz cd zfs-0.7.0

#git apply --check ../patch.txt 
error: patch failed: include/sys/spa_impl.h:239
error: include/sys/spa_impl.h: patch does not apply
error: patch failed: module/zfs/ddt.c:528
error: module/zfs/ddt.c: patch does not apply
error: patch failed: module/zfs/spa_misc.c:635
error: module/zfs/spa_misc.c: patch does not apply

# git apply --check ../W7znZ5X7
fatal: corrupt patch at line 66
mailinglists35 commented 7 years ago

oddly the two files are reported by diff to be nearly identical, with the edited one having a newline at end of line after I pasted

# diff patch.txt W7znZ5X7
66,67c66
<      * setting SPA_FEATURE_DISABLED for all entries in the feature
<
---
>      * setting SPA_FEATURE_DISABLED for all entries in the feature
\ No newline at end of file
mailinglists35 commented 7 years ago

what I am doing wrong?

ironMann commented 7 years ago

@mailinglists35 Not sure what is the cause. I opened a PR #5425 with the patch.

mailinglists35 commented 7 years ago

thank you, it applied cleanly. running now with both pr 5425 and 5412, still hangs. new perf report -g output:

Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
  Children      Self  Command          Shared Object            Symbol                                                                                      ◆
+   43.81%     0.00%  init             [kernel.kallsyms]        [k] zfs_write                                                                               ▒
+   43.81%     0.00%  init             [kernel.kallsyms]        [k] zpl_write_common_iovec                                                                  ▒
+   43.81%     0.00%  init             [kernel.kallsyms]        [k] zpl_write                                                                               ▒
+   43.81%     0.00%  init             [kernel.kallsyms]        [k] __vfs_write                                                                             ▒
+   43.81%     0.00%  init             [kernel.kallsyms]        [k] vfs_write                                                                               ▒
+   43.81%     0.00%  init             [kernel.kallsyms]        [k] sys_write                                                                               ▒
+   43.81%     0.00%  init             [kernel.kallsyms]        [k] entry_SYSCALL_64_fastpath                                                               ▒
+   43.26%     0.92%  init             [kernel.kallsyms]        [k] dmu_tx_assign                                                                           ▒
+   36.19%     0.14%  init             [kernel.kallsyms]        [k] dmu_tx_wait                                                                             ▒
+   34.95%     0.22%  init             [kernel.kallsyms]        [k] txg_wait_open                                                                           ▒
+   22.41%     0.02%  init             [kernel.kallsyms]        [k] __cv_wait                                                                               ▒
+   22.34%     1.54%  init             [kernel.kallsyms]        [k] mutex_lock                                                                              ▒
+   22.16%     0.69%  init             [kernel.kallsyms]        [k] cv_wait_common                                                                          ▒
+   20.81%     0.09%  init             [kernel.kallsyms]        [k] __mutex_lock_slowpath                                                                   ▒
+   20.51%     0.83%  init             [kernel.kallsyms]        [k] mutex_optimistic_spin                                                                   ▒
+   19.88%     0.00%  txg_sync         [kernel.kallsyms]        [k] thread_generic_wrapper                                                                  ▒
+   19.88%     0.00%  txg_sync         [kernel.kallsyms]        [k] kthread                                                                                 ▒
+   19.88%     0.00%  txg_sync         [kernel.kallsyms]        [k] ret_from_fork                                                                           ▒
+   19.70%     0.60%  txg_sync         [kernel.kallsyms]        [k] txg_sync_thread                                                                         ▒
+   18.28%     0.00%  txg_quiesce      [kernel.kallsyms]        [k] thread_generic_wrapper                                                                  ▒
+   18.28%     0.00%  txg_quiesce      [kernel.kallsyms]        [k] kthread                                                                                 ▒
+   18.28%     0.00%  txg_quiesce      [kernel.kallsyms]        [k] ret_from_fork                                                                           ▒
+   16.56%     1.43%  txg_quiesce      [kernel.kallsyms]        [k] txg_quiesce_thread                                                                      ▒
+   14.35%     0.33%  txg_sync         [kernel.kallsyms]        [k] spa_sync                                                                                ▒
+   11.42%    11.42%  init             [kernel.kallsyms]        [k] mutex_spin_on_owner.isra.3                                                              ▒
+   10.75%     1.69%  init             [kernel.kallsyms]        [k] __schedule                                                                              ▒
+   10.33%     0.09%  init             [kernel.kallsyms]        [k] schedule                                                                                ▒
+    8.93%     0.00%  kswapd0          [kernel.kallsyms]        [k] kthread                                                                                 ▒
+    8.93%     0.00%  kswapd0          [kernel.kallsyms]        [k] ret_from_fork                                                                           ▒
+    8.91%     0.01%  kswapd0          [kernel.kallsyms]        [k] kswapd                                                                                  ▒
+    8.84%     0.14%  txg_quiesce      [kernel.kallsyms]        [k] __cv_broadcast                                                                          ▒
+    8.79%     0.02%  txg_quiesce      [kernel.kallsyms]        [k] __wake_up                                                                               ▒
+    8.54%     0.21%  txg_quiesce      [kernel.kallsyms]        [k] __wake_up_common                                                                        ▒
+    8.33%     0.11%  txg_quiesce      [kernel.kallsyms]        [k] autoremove_wake_function                                                                ▒
+    8.22%     0.01%  txg_quiesce      [kernel.kallsyms]        [k] default_wake_function                                                                   ▒
+    8.18%     0.64%  txg_quiesce      [kernel.kallsyms]        [k] try_to_wake_up                                                                          ▒
+    8.11%     0.20%  kswapd0          [kernel.kallsyms]        [k] shrink_zone                                                                             ▒
+    7.77%     7.77%  init             [kernel.kallsyms]        [k] osq_lock                                                                                ▒
+    7.49%     0.12%  swapper          [kernel.kallsyms]        [k] cpu_startup_entry                                                                       ▒
+    5.25%     0.06%  txg_quiesce      [kernel.kallsyms]        [k] ttwu_do_activate.constprop.90                                                           ▒
+    4.92%     2.06%  kswapd0          [kernel.kallsyms]        [k] shrink_slab.part.40                                                                     ▒
+    4.90%     0.11%  init             [kernel.kallsyms]        [k] deactivate_task                                                                         ▒
+    4.86%     1.67%  txg_sync         [kernel.kallsyms]        [k] mutex_lock                                                                              ▒
+    4.59%     0.22%  txg_quiesce      [kernel.kallsyms]        [k] activate_task                                                                           ▒
+    4.55%     0.03%  txg_sync         [kernel.kallsyms]        [k] zap_contains                                                                            ▒
+    4.53%     0.41%  init             [kernel.kallsyms]        [k] dequeue_task_fair                                                                       ▒
+    4.51%     0.03%  txg_sync         [kernel.kallsyms]        [k] zap_lookup_norm                                                                         ▒
+    4.50%     0.00%  swapper          [kernel.kallsyms]        [k] start_secondary                                                                         ▒
+    4.16%     0.36%  txg_quiesce      [kernel.kallsyms]        [k] enqueue_task_fair                                                                       ▒
+    3.86%     0.83%  init             [kernel.kallsyms]        [k] dequeue_entity                                                                          ▒
+    3.36%     0.82%  init             [kernel.kallsyms]        [k] pick_next_task_fair                                                                     ▒
+    3.32%     0.23%  txg_sync         [kernel.kallsyms]        [k] zio_wait                                                                                ▒
+    3.30%     0.97%  txg_quiesce      [kernel.kallsyms]        [k] enqueue_entity                                                                          ▒
+    3.19%     0.01%  txg_sync         [kernel.kallsyms]        [k] __mutex_lock_slowpath                                                                   ▒
+    3.17%     0.13%  txg_sync         [kernel.kallsyms]        [k] mutex_optimistic_spin                                                                   ▒
+    3.06%     0.00%  swapper          [kernel.kallsyms]        [k] rest_init                                                                               ▒
+    3.06%     0.00%  swapper          [kernel.kallsyms]        [k] start_kernel                                                                            ▒
+    3.06%     0.00%  swapper          [kernel.kallsyms]        [k] x86_64_start_reservations                                                               ▒
+    3.06%     0.00%  swapper          [kernel.kallsyms]        [k] x86_64_start_kernel                                                                     ▒
mailinglists35 commented 7 years ago
Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
  Children      Self  Command          Shared Object            Symbol                                                                                      ◆
-   43.81%     0.00%  init             [kernel.kallsyms]        [k] zfs_write                                                                               ▒
   - zfs_write                                                                                                                                              ▒
      - 43.26% dmu_tx_assign                                                                                                                                ▒
         + 36.18% dmu_tx_wait                                                                                                                               ▒
         + 1.69% txg_hold_open                                                                                                                              ▒
         + 1.43% dsl_dir_tempreserve_space                                                                                                                  ▒
         + 0.48% mutex_lock                                                                                                                                 ▒
           0.42% mutex_unlock                                                                                                                               ▒
         + 0.41% txg_rele_to_sync                                                                                                                           ▒
         + 0.39% dsl_pool_need_dirty_delay                                                                                                                  ▒
         + 0.36% dsl_dataset_prev_snap_txg                                                                                                                  ▒
           0.36% _raw_spin_lock                                                                                                                             ▒
           0.17% __cv_broadcast                                                                                                                             ▒
         + 0.15% txg_rele_to_quiesce                                                                                                                        ▒
           0.08% spa_last_synced_txg                                                                                                                        ▒
           0.07% arc_tempreserve_space                                                                                                                      ▒
           0.04% _cond_resched                                                                                                                              ▒
           0.03% spl_kmem_alloc                                                                                                                             ▒
           0.02% spa_tx_assign_add_nsecs                                                                                                                    ▒
         + 0.02% apic_timer_interrupt                                                                                                                       ▒
           0.02% getrawmonotonic64                                                                                                                          ▒
           0.02% txg_wait_open                                                                                                                              ▒
           0.01% spa_get_asize                                                                                                                              ▒
           0.01% spa_suspended                                                                                                                              ▒
           0.01% dsl_dir_tempreserve_clear                                                                                                                  ▒
        0.37% mutex_unlock                                                                                                                                  ▒
        0.03% spa_suspended                                                                                                                                 ▒
        0.02% mutex_lock                                                                                                                                    ▒
        0.02% dsl_dir_tempreserve_space                                                                                                                     ▒
        0.02% _raw_spin_lock                                                                                                                                ▒
        0.02% dsl_dataset_prev_snap_txg                                                                                                                     ▒
        0.02% dsl_pool_need_dirty_delay                                                                                                                     ▒
      + 0.01% apic_timer_interrupt                                                                                                                          ▒
        0.01% spa_get_asize                                                                                                                                 ▒
        0.01% txg_hold_open                                                                                                                                 ▒
        0.01% txg_rele_to_quiesce                                                                                                                           ▒
        0.01% dmu_tx_wait                                                                                                                                   ▒
        0.00% __cv_broadcast                                                                                                                                ▒
        0.00% txg_rele_to_sync                                                                                                                              ▒
mailinglists35 commented 7 years ago
-   43.26%     0.92%  init             [kernel.kallsyms]        [k] dmu_tx_assign                                                                           ▒
   - 42.34% dmu_tx_assign                                                                                                                                   ▒
      - 36.18% dmu_tx_wait                                                                                                                                  ▒
         - 34.94% txg_wait_open                                                                                                                             ▒
            - 22.39% __cv_wait                                                                                                                              ▒
               - 22.12% cv_wait_common                                                                                                                      ▒
                  - 10.28% mutex_lock                                                                                                                       ▒
                     - 10.07% __mutex_lock_slowpath                                                                                                         ▒
                        - 9.97% mutex_optimistic_spin                                                                                                       ▒
                             5.30% mutex_spin_on_owner.isra.3                                                                                               ▒
                             4.13% osq_lock                                                                                                                 ▒
                             0.09% osq_unlock                                                                                                               ▒
                           + 0.08% apic_timer_interrupt                                                                                                     ▒
                           + 0.05% schedule_preempt_disabled                                                                                                ▒
                           + 0.00% ret_from_intr                                                                                                            ▒
                             0.00% schedule                                                                                                                 ▒
                        + 0.05% schedule_preempt_disabled                                                                                                   ▒
                          0.01% mutex_spin_on_owner.isra.3                                                                                                  ▒
                          0.01% osq_lock                                                                                                                    ▒
                          0.01% osq_unlock                                                                                                                  ▒
                        + 0.01% _raw_spin_lock                                                                                                              ▒
                        + 0.00% apic_timer_interrupt                                                                                                        ▒
                          0.00% smp_reschedule_interrupt                                                                                                    ▒
                       0.00% mutex_optimistic_spin                                                                                                          ▒
                       0.00% _raw_spin_lock                                                                                                                 ▒
                  - 10.05% schedule                                                                                                                         ▒
                     - 9.90% __schedule                                                                                                                     ▒
                        - 4.82% deactivate_task                                                                                                             ▒
                           - 4.45% dequeue_task_fair                                                                                                        ▒
                              + 3.77% dequeue_entity                                                                                                        ▒
                                0.09% update_cfs_shares                                                                                                     ▒
                                0.07% account_entity_dequeue                                                                                                ▒
                                0.04% update_curr                                                                                                           ▒
                                0.03% clear_buddies                                                                                                         ▒
                                0.03% set_next_buddy                                                                                                        ▒
                                0.02% update_min_vruntime                                                                                                   ▒
                                0.00% __compute_runnable_contrib                                                                                            ▒
                           + 0.20% update_rq_clock.part.80                                                                                                  ▒
                             0.03% dequeue_entity                                                                                                           ▒
                             0.01% sched_clock_cpu                                                                                                          ▒
                             0.01% update_cfs_shares                                                                                                        ▒
                             0.01% set_next_buddy                                                                                                           ▒
                             0.01% hrtick_update                                                                                                            ▒
                        - 3.23% pick_next_task_fair                                                                                                         ▒
                           + 1.70% set_next_entity                                                                                                          ▒
                           + 0.30% pick_next_entity                                                                                                         ▒
mailinglists35 commented 7 years ago

also now i see txg_quiesce going up to 90%

-   18.28%     0.00%  txg_quiesce      [kernel.kallsyms]        [k] thread_generic_wrapper                                                                  ▒
   - thread_generic_wrapper                                                                                                                                 ▒
      - 16.56% txg_quiesce_thread                                                                                                                           ▒
         + 8.84% __cv_broadcast                                                                                                                             ▒
         + 2.03% __cv_wait_sig                                                                                                                              ▒
         + 1.87% mutex_lock                                                                                                                                 ▒
         + 1.12% _raw_spin_lock                                                                                                                             ▒
           0.87% __bitmap_weight                                                                                                                            ▒

Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
  Children      Self  Command          Shared Object            Symbol                                                                                      ▒
-   18.28%     0.00%  txg_quiesce      [kernel.kallsyms]        [k] thread_generic_wrapper                                                                  ◆
   - thread_generic_wrapper                                                                                                                                 ▒
      - 16.56% txg_quiesce_thread                                                                                                                           ▒
         - 8.84% __cv_broadcast                                                                                                                             ▒
            - 8.70% __wake_up                                                                                                                               ▒
               - 8.53% __wake_up_common                                                                                                                     ▒
                  - 8.32% autoremove_wake_function                                                                                                          ▒
                     - 8.21% default_wake_function                                                                                                          ▒
                        - 8.11% try_to_wake_up                                                                                                              ▒
                           - 5.23% ttwu_do_activate.constprop.90                                                                                            ▒
                              - 4.56% activate_task                                                                                                         ▒
                                 - 4.14% enqueue_task_fair                                                                                                  ▒
                                    - 3.25% enqueue_entity                                                                                                  ▒
                                       - 1.22% update_cfs_shares                                                                                            ▒
                                          - 0.33% update_curr                                                                                               ▒
                                               0.16% __calc_delta                                                                                           ▒
                                               0.02% update_min_vruntime                                                                                    ▒
                                            0.03% account_entity_dequeue                                                                                    ▒
                                            0.03% account_entity_enqueue                                                                                    ▒
                                            0.02% __calc_delta                                                                                              ▒
                                            0.01% update_min_vruntime                                                                                       ▒
                                       + 0.46% update_curr                                                                                                  ▒
                                       + 0.23% __enqueue_entity             

Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
  Children      Self  Command          Shared Object            Symbol                                                                                      ◆
-   18.28%     0.00%  txg_quiesce      [kernel.kallsyms]        [k] thread_generic_wrapper                                                                  ▒
   - thread_generic_wrapper                                                                                                                                 ▒
      - 16.56% txg_quiesce_thread                                                                                                                           ▒
         + 8.84% __cv_broadcast                                                                                                                             ▒
         - 2.03% __cv_wait_sig                                                                                                                              ▒
            - 1.96% cv_wait_common                                                                                                                          ▒
               - 0.92% schedule                                                                                                                             ▒
                  + 0.87% __schedule                                                                                                                        ▒
                    0.01% pick_next_task_idle                                                                                                               ▒
                    0.01% _raw_spin_lock_irq                                                                                                                ▒
                    0.00% pick_next_task_fair                                                                                                               ▒
                    0.00% deactivate_task                                                                                                                   ▒
                    0.00% rcu_note_context_switch                                                                                                           ▒
                    0.00% finish_task_switch                                                                                                                ▒
                  + 0.00% apic_timer_interrupt                                                                                                              ▒
               + 0.43% mutex_lock                                                                                                                           ▒
               + 0.15% mutex_unlock                                                                                                                         ▒
               + 0.09% prepare_to_wait_exclusive                                                                                                            ▒
               + 0.09% __wake_up                                                                                                                            ▒
               + 0.04% _raw_spin_lock                                                                                                                       ▒
                 0.01% _raw_spin_unlock_irqrestore                                                                                                          ▒
               + 0.01% finish_wait                                                                                                                          ▒
                 0.01% _raw_spin_lock_irqsave                                                                                                               ▒
                 0.01% _cond_resched                                                                                                                        ▒
                 0.00% __mutex_unlock_slowpath                                                                                                              ▒
                 0.00% __mutex_lock_slowpath                                                                                                                ▒
                 0.00% __schedule                                                                                                                           ▒
                 0.00% __wake_up_common                                                                                                                     ▒
               + 0.00% apic_timer_interrupt                                                                                                                 ▒
              0.05% mutex_unlock                                                                                                                            ▒
              0.01% mutex_lock                                                                                                                              ▒
              0.00% schedule                                                                                                                                ▒
              0.00% finish_wait                                                                                                                             ▒
              0.00% prepare_to_wait_exclusive                                                                                                               ▒
              0.00% __wake_up                                                                                                                               ▒
            + 0.00% apic_timer_interrupt                                                                                                                    ▒
              0.00% _raw_spin_lock                                                                                                                          ▒
         - 1.87% mutex_lock                                                                                                                                 ▒
            - 0.39% __mutex_lock_slowpath                                                                                                                   ▒
               - 0.38% mutex_optimistic_spin                                                                                                                ▒
                    0.21% mutex_spin_on_owner.isra.3                                                                                                        ▒
                    0.05% osq_lock                                                                                                                          ▒
                    0.02% osq_unlock                                                                                                                        ▒
                  - 0.01% apic_timer_interrupt                                                                                                              ▒
                     - smp_apic_timer_interrupt                                                                                                             ▒
                        - 0.01% irq_exit                                                                                                                    ▒
                             __do_softirq                                                                                                                   ▒
                             run_rebalance_domains                                                                                                          ▒
                             rebalance_domains                                                                                                              ▒
                             update_blocked_averages                                                                                                        ▒
                        - 0.00% local_apic_timer_interrupt                                                                                                  ▒
                           - hrtimer_interrupt                                                                                                              ▒
                              - ktime_get_update_offsets_now                                                                                                ▒
                                read_tsc                                    
mailinglists35 commented 7 years ago
# cat /proc/spl/kstat/zfs/arcstats
11 1 0x01 92 4416 6608467357140 10345217200879
name                            type data
hits                            4    1526265
misses                          4    1841784
demand_data_hits                4    468546
demand_data_misses              4    88614
demand_metadata_hits            4    357372
demand_metadata_misses          4    1599485
prefetch_data_hits              4    720
prefetch_data_misses            4    175
prefetch_metadata_hits          4    699627
prefetch_metadata_misses        4    153510
mru_hits                        4    102644
mru_ghost_hits                  4    48898
mfu_hits                        4    725001
mfu_ghost_hits                  4    68485
deleted                         4    217533
mutex_miss                      4    1132345
evict_skip                      4    205204542
evict_not_enough                4    2195789
evict_l2_cached                 4    0
evict_l2_eligible               4    5206740992
evict_l2_ineligible             4    1047908352
evict_l2_skip                   4    0
hash_elements                   4    29114
hash_elements_max               4    75820
hash_collisions                 4    17787
hash_chains                     4    404
hash_chain_max                  4    3
p                               4    1556864
c                               4    33554432
c_min                           4    33554432
c_max                           4    4111763456
size                            4    1154707936
compressed_size                 4    332038656
uncompressed_size               4    751435776
overhead_size                   4    582264832
hdr_size                        4    12791456
data_size                       4    94294016
metadata_size                   4    820009472
dbuf_size                       4    50510512
dnode_size                      4    133581840
bonus_size                      4    43520640
anon_size                       4    311985152
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    468071424
mru_evictable_data              4    0
mru_evictable_metadata          4    0
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    134246912
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    33021440
mfu_ghost_evictable_data        4    262144
mfu_ghost_evictable_metadata    4    32759296
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    2689
memory_indirect_count           4    3485
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    1284666
arc_meta_used                   4    1060413920
arc_meta_limit                  4    3083822592
arc_dnode_limit                 4    308382259
arc_meta_max                    4    1450769152
arc_meta_min                    4    16777216
sync_wait_for_async             4    7222
demand_hit_predictive_prefetch  4    34318
arc_need_free                   4    0
arc_sys_free                    4    128491520
mailinglists35 commented 7 years ago

i can always trigger this by simultaneously starting strongswan inside 600 identical lxd containers (all appear to be zfs clones). inside a container there is ubuntu trusty, doing a normal init (i've disabled as many boot services as i could, and after a normal startup it shows only init, dhclient and systemd-udev processes)

ironMann commented 7 years ago

@mailinglists35 With #5425 you don't see ddt_get_dedup_dspace and friends in your traces, so the patch is worth keeping. But, due to nature of your workload you continue to experience a lot of lock contention. If you're inclined to do more testing, I have #5428 and zfsonlinux/spl#587 (They must go together) which should decrease mutex overhead, in my limited testing by about 10 - 14%.

mailinglists35 commented 7 years ago

@ironMann thank you. I have applied pr 5428 and 587 on top of existing 5412+5425. there is no change in cpu consumption. perf record -ag sleep 10 pasted below

  PID  THR   SYSCPU  USRCPU   VGROW   RGROW  RDDSK   WRDSK ST  EXC  S CPUNR   CPU  CMD        1/2
15023    1    0.98s   0.00s      0K      0K     0K      0K --    -  R     1  100%  txg_sync
15022    1    0.89s   0.00s      0K      0K     0K      0K --    -  R     3   92%  txg_quiesce

Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833
  Children      Self  Command          Shared Object       Symbol
+   44.93%     0.00%  init             [kernel.kallsyms]   [k] zfs_write
+   44.93%     0.00%  init             [kernel.kallsyms]   [k] zpl_write_common_iovec
+   44.93%     0.00%  init             [kernel.kallsyms]   [k] zpl_write
+   44.93%     0.00%  init             [kernel.kallsyms]   [k] __vfs_write
+   44.93%     0.00%  init             [kernel.kallsyms]   [k] vfs_write
+   44.93%     0.00%  init             [kernel.kallsyms]   [k] sys_write
+   44.93%     0.00%  init             [kernel.kallsyms]   [k] entry_SYSCALL_64_fastpath
+   44.42%     1.30%  init             [kernel.kallsyms]   [k] dmu_tx_assign
+   37.59%     0.19%  init             [kernel.kallsyms]   [k] dmu_tx_wait
+   36.50%     0.48%  init             [kernel.kallsyms]   [k] txg_wait_open
+   25.95%     0.00%  txg_sync         [kernel.kallsyms]   [k] thread_generic_wrapper
+   25.95%     0.00%  txg_sync         [kernel.kallsyms]   [k] kthread
+   25.95%     0.00%  txg_sync         [kernel.kallsyms]   [k] ret_from_fork
+   25.75%     0.68%  txg_sync         [kernel.kallsyms]   [k] txg_sync_thread
+   24.38%     1.42%  init             [kernel.kallsyms]   [k] mutex_lock
+   22.98%     0.09%  init             [kernel.kallsyms]   [k] __mutex_lock_slowpath
+   22.71%     0.71%  init             [kernel.kallsyms]   [k] mutex_optimistic_spin
+   21.74%     0.03%  init             [kernel.kallsyms]   [k] __cv_wait
+   21.40%     0.83%  init             [kernel.kallsyms]   [k] cv_wait_common
+   18.52%     0.00%  txg_quiesce      [kernel.kallsyms]   [k] thread_generic_wrapper
+   18.52%     0.00%  txg_quiesce      [kernel.kallsyms]   [k] kthread
+   18.52%     0.00%  txg_quiesce      [kernel.kallsyms]   [k] ret_from_fork
+   17.58%     0.35%  txg_sync         [kernel.kallsyms]   [k] spa_sync
+   16.88%     3.09%  txg_quiesce      [kernel.kallsyms]   [k] txg_quiesce_thread
+   12.22%    12.22%  init             [kernel.kallsyms]   [k] mutex_spin_on_owner.isra.3
+    9.97%     1.49%  init             [kernel.kallsyms]   [k] __schedule
+    9.74%     0.23%  init             [kernel.kallsyms]   [k] schedule
+    9.39%     9.39%  init             [kernel.kallsyms]   [k] osq_lock
+    8.92%     0.11%  txg_quiesce      [kernel.kallsyms]   [k] __cv_broadcast
+    8.89%     0.03%  txg_quiesce      [kernel.kallsyms]   [k] __wake_up
+    8.54%     0.22%  txg_quiesce      [kernel.kallsyms]   [k] __wake_up_common
+    8.31%     0.12%  txg_quiesce      [kernel.kallsyms]   [k] autoremove_wake_function
+    8.20%     0.02%  txg_quiesce      [kernel.kallsyms]   [k] default_wake_function
+    8.16%     0.64%  txg_quiesce      [kernel.kallsyms]   [k] try_to_wake_up
+    7.63%     2.02%  txg_sync         [kernel.kallsyms]   [k] mutex_lock
+    7.58%     0.12%  swapper          [kernel.kallsyms]   [k] cpu_startup_entry
+    5.61%     0.01%  txg_sync         [kernel.kallsyms]   [k] __mutex_lock_slowpath
+    5.58%     0.18%  txg_sync         [kernel.kallsyms]   [k] mutex_optimistic_spin
+    5.49%     0.02%  txg_sync         [kernel.kallsyms]   [k] zap_contains
+    5.46%     0.03%  txg_sync         [kernel.kallsyms]   [k] zap_lookup_norm
+    5.28%     0.06%  txg_quiesce      [kernel.kallsyms]   [k] ttwu_do_activate.constprop.90
+    4.76%     0.00%  swapper          [kernel.kallsyms]   [k] start_secondary
+    4.70%     4.70%  txg_sync         [kernel.kallsyms]   [k] osq_lock
+    4.62%     0.22%  txg_quiesce      [kernel.kallsyms]   [k] activate_task
+    4.59%     0.12%  init             [kernel.kallsyms]   [k] deactivate_task
+    4.24%     0.32%  init             [kernel.kallsyms]   [k] dequeue_task_fair
+    4.18%     0.35%  txg_sync         [kernel.kallsyms]   [k] zio_wait
+    4.18%     0.34%  txg_quiesce      [kernel.kallsyms]   [k] enqueue_task_fair
+    3.75%     0.11%  txg_sync         [kernel.kallsyms]   [k] dsl_pool_sync
+    3.68%     0.76%  init             [kernel.kallsyms]   [k] dequeue_entity
+    3.34%     0.98%  txg_quiesce      [kernel.kallsyms]   [k] enqueue_entity
+    3.17%     0.77%  init             [kernel.kallsyms]   [k] pick_next_task_fair
+    3.08%     0.14%  txg_sync         [kernel.kallsyms]   [k] ddt_sync
+    3.01%     1.17%  swapper          [kernel.kallsyms]   [k] hrtimer_start_range_ns
+    2.89%     0.00%  swapper          [kernel.kallsyms]   [k] rest_init
+    2.89%     0.00%  swapper          [kernel.kallsyms]   [k] start_kernel
+    2.89%     0.00%  swapper          [kernel.kallsyms]   [k] x86_64_start_reservations
+    2.89%     0.00%  swapper          [kernel.kallsyms]   [k] x86_64_start_kernel
+    2.70%     0.03%  txg_sync         [kernel.kallsyms]   [k] zap_lockdir
+    2.65%     0.03%  swapper          [kernel.kallsyms]   [k] tick_nohz_idle_enter
+    2.56%     0.13%  swapper          [kernel.kallsyms]   [k] __tick_nohz_idle_enter
+    2.37%     0.03%  txg_sync         [kernel.kallsyms]   [k] dmu_buf_hold
+    2.34%     0.68%  init             [kernel.kallsyms]   [k] __cv_broadcast
+    2.31%     0.02%  txg_sync         [kernel.kallsyms]   [k] zap_lookup_impl
+    2.30%     1.43%  txg_quiesce      [kernel.kallsyms]   [k] mutex_lock
+    2.23%     0.07%  swapper          [kernel.kallsyms]   [k] tick_nohz_idle_exit
+    2.21%     0.01%  txg_quiesce      [kernel.kallsyms]   [k] __cv_wait_sig
+    2.14%     0.04%  txg_sync         [kernel.kallsyms]   [k] dmu_buf_hold_noread
+    2.14%     0.28%  txg_quiesce      [kernel.kallsyms]   [k] cv_wait_common
+    2.00%     0.47%  txg_sync         [kernel.kallsyms]   [k] zio_done
+    1.97%     0.02%  swapper          [kernel.kallsyms]   [k] tick_nohz_restart
+    1.91%     1.82%  txg_sync         [kernel.kallsyms]   [k] mutex_unlock
+    1.78%     0.00%  init             libc-2.19.so        [.] 0xffff80b517d9d710
+    1.77%     0.03%  txg_sync         [kernel.kallsyms]   [k] fzap_lookup
+    1.69%     0.00%  init             libc-2.19.so        [.] 0xffff8065c7dcc710
+    1.68%     0.88%  init             [kernel.kallsyms]   [k] update_cfs_shares
+    1.67%     0.32%  init             [kernel.kallsyms]   [k] set_next_entity
+    1.66%     0.04%  init             [kernel.kallsyms]   [k] __wake_up
+    1.66%     0.00%  init             [unknown]           [k] 0x00005642200e6170
For a higher level overview, try: perf report --sort comm,dso

Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833
  Children      Self  Command          Shared Object       Symbol                                             ◆
-   44.93%     0.00%  init             [kernel.kallsyms]   [k] zfs_write                                      ▒
   - zfs_write                                                                                                ▒
      - 44.42% dmu_tx_assign                                                                                  ▒
         - 37.58% dmu_tx_wait                                                                                 ▒
            - 36.46% txg_wait_open                                                                            ▒
               - 21.71% __cv_wait                                                                             ▒
                  - 21.39% cv_wait_common                                                                     ▒
                     - 10.38% mutex_lock                                                                      ▒
                        - 10.19% __mutex_lock_slowpath                                                        ▒
                           - 10.10% mutex_optimistic_spin                                                     ▒
                                4.86% mutex_spin_on_owner.isra.3                                              ▒
                                4.79% osq_lock                                                                ▒
                                0.08% osq_unlock                                                              ▒
                              + 0.05% schedule_preempt_disabled                                               ▒
                              + 0.04% apic_timer_interrupt                                                    ▒
                                0.00% ret_from_intr                                                           ▒
                           + 0.04% schedule_preempt_disabled                                                  ▒
                           + 0.01% _raw_spin_lock                                                             ▒
                             0.01% mutex_spin_on_owner.isra.3                                                 ▒
                             0.01% osq_unlock                                                                 ▒
                             0.00% osq_lock                                                                   ▒
                           + 0.00% reschedule_interrupt                                                       ▒
                           + 0.00% apic_timer_interrupt                                                       ▒
                          0.00% mutex_optimistic_spin                                                         ▒
                     - 9.38% schedule                                                                         ▒
                        - 9.23% __schedule                                                                    ▒
                           - 4.55% deactivate_task                                                            ▒
                              - 4.19% dequeue_task_fair                                                       ▒
                                 - 3.61% dequeue_entity                                                       ▒
                                    - 1.54% update_cfs_shares                                                 ▒
                                       - 0.66% update_curr                                                    ▒
                                            0.20% __calc_delta                                                ▒
                                            0.13% update_min_vruntime                                         ▒
                                         0.05% account_entity_enqueue                                         ▒
                                         0.04% account_entity_dequeue                                         ▒
                                         0.02% __calc_delta                                                   ▒
                                         0.02% update_min_vruntime                                            ▒
                                    + 0.73% update_curr                                                       ▒
                                      0.29% account_entity_dequeue                                            ▒
                                      0.22% cpuacct_charge                                                    ▒
                                      0.06% update_min_vruntime                                               ▒
                                      0.03% account_entity_enqueue                                            ▒
                                      0.02% clear_buddies                                                     ▒
                                   0.09% update_cfs_shares                                                    ▒
                                   0.07% account_entity_dequeue                                               ▒
                                   0.04% set_next_buddy                                                       ▒
                                   0.03% update_curr                                                          ▒
                                   0.02% update_min_vruntime                                                  ▒
                                   0.02% clear_buddies                                                        ▒
                                   0.00% __compute_runnable_contrib                                           ▒
                              + 0.18% update_rq_clock.part.80                                                 ▒
                                0.03% dequeue_entity                                                          ▒
                                0.01% sched_clock_cpu                                                         ▒
                                0.01% update_cfs_shares                                                       ▒
                                0.01% set_next_buddy                                                          ▒
                                0.00% hrtick_update                                                           ▒
                           + 3.06% pick_next_task_fair                                                        ▒
                           + 0.08% _raw_spin_lock_irq                                                         ▒
                             0.07% finish_task_switch                                                         ▒
                           + 0.05% apic_timer_interrupt                                                       ▒
                             0.03% pick_next_entity                                                           ▒
                             0.03% put_prev_entity                                                            ▒
                             0.02% check_cfs_rq_runtime                                                       ▒
                             0.02% set_next_entity                                                            ▒
                             0.01% rcu_note_context_switch                                                    ▒
                             0.01% dequeue_task_fair                                                          ▒
                             0.01% update_curr                                                                ▒
                             0.01% update_rq_clock.part.80                                                    ▒
                           + 0.00% pick_next_task_idle                                                        ▒
                           + 0.00% reschedule_interrupt                                                       ▒
                             0.00% __msecs_to_jiffies                                                         ▒
                             0.00% smp_reschedule_interrupt                                                   ▒
                             0.00% ret_from_intr                                                              ▒
                          0.01% rcu_note_context_switch                                                       ▒
                          0.01% _raw_spin_lock_irq                                                            ▒
                          0.01% pick_next_task_fair                                                           ▒
                          0.01% finish_task_switch                                                            ▒
                          0.01% deactivate_task                                                               ▒
                          0.00% pick_next_task_idle                                                           ▒
                     + 0.48% prepare_to_wait_exclusive                                                        ▒
                     + 0.22% mutex_unlock                                                                     ▒
                       0.02% __schedule                                                                       ▒
                     + 0.02% finish_wait                                                                      ▒
                       0.02% _raw_spin_unlock_irqrestore                                                      ▒
                       0.01% __mutex_lock_slowpath                                                            ▒
                       0.01% _cond_resched                                                                    ▒
                       0.00% _raw_spin_lock_irqsave                                                           ▒
                       0.00% __mutex_unlock_slowpath                                                          ▒
                     + 0.00% apic_timer_interrupt                                                             ▒
                    0.18% mutex_unlock                                                                        ▒
                    0.12% schedule                                                                            ▒
                    0.01% prepare_to_wait_exclusive                                                           ▒
                    0.01% finish_wait                                                                         ▒
                    0.01% mutex_lock                                                                          ▒
                  + 0.01% apic_timer_interrupt                                                                ▒
               - 12.11% mutex_lock                                                                            ▒
                  - 11.86% __mutex_lock_slowpath                                                              ▒
                     - 11.73% mutex_optimistic_spin                                                           ▒
                          6.61% mutex_spin_on_owner.isra.3                                                    ▒
                          4.57% osq_lock                                                                      ▒
                          0.10% osq_unlock                                                                    ▒
                        + 0.07% schedule_preempt_disabled                                                     ▒
                        + 0.05% apic_timer_interrupt                                                          ▒
                     + 0.07% schedule_preempt_disabled                                                        ▒
                     + 0.02% _raw_spin_lock                                                                   ▒
                       0.01% osq_lock                                                                         ▒
                       0.01% mutex_spin_on_owner.isra.3                                                       ▒
                       0.00% osq_unlock                                                                       ▒
mailinglists35 commented 7 years ago

and expanded call path for txg_quiesce

-   18.52%     0.00%  txg_quiesce      [kernel.kallsyms]   [k] thread_generic_wrapper                         ▒
   - thread_generic_wrapper                                                                                   ▒
      - 16.88% txg_quiesce_thread                                                                             ▒
         - 8.91% __cv_broadcast                                                                               ▒
            - 8.80% __wake_up                                                                                 ▒
               - 8.53% __wake_up_common                                                                       ◆
                  - 8.31% autoremove_wake_function                                                            ▒
                     - 8.19% default_wake_function                                                            ▒
                        - 8.10% try_to_wake_up                                                                ▒
                           - 5.24% ttwu_do_activate.constprop.90                                              ▒
                              + 4.59% activate_task                                                           ▒
                              - 0.58% ttwu_do_wakeup                                                          ▒
                                 - 0.52% check_preempt_curr                                                   ▒
                                    - 0.36% check_preempt_wakeup                                              ▒
                                       - 0.10% wakeup_preempt_entity.isra.55                                  ▒
                                            __calc_delta                                                      ▒
                                       - 0.04% update_curr                                                    ▒
                                            cpuacct_charge                                                    ▒
                                       - 0.02% resched_curr                                                   ▒
                                          - 0.01% native_smp_send_reschedule                                  ▒
                                             - physflat_send_IPI_mask                                         ▒
                                             - default_send_IPI_mask_sequence_phys                            ▒
                                                - find_next_bit                                               ▒
                                                  _find_next_bit.part.0                                       ▒
                                            0.00% physflat_send_IPI_mask                                      ▒
                                         0.01% __calc_delta                                                   ▒
                                         0.00% set_next_buddy                                                 ▒
                                      0.01% update_curr                                                       ▒
                                      0.01% wakeup_preempt_entity.isra.55                                     ▒
                                      0.00% resched_curr                                                      ▒
                                   0.01% resched_curr                                                         ▒
                                   0.01% check_preempt_wakeup                                                 ▒
                                0.01% enqueue_task_fair                                                       ▒
                                0.01% check_preempt_curr                                                      ▒
                                0.01% update_rq_clock.part.80                                                 ▒
                           - 1.20% select_task_rq_fair                                                        ▒
                              - 0.92% select_idle_sibling                                                     ▒
                                   0.15% idle_cpu                                                             ▒
                                 - 0.11% find_next_bit                                                        ▒
                                      _find_next_bit.part.0                                                   ▒
                                   0.07% __bitmap_intersects                                                  ▒
                                   0.07% _find_next_bit.part.0                                                ▒
                                   0.00% cpumask_next_and                                                     ▒
                                0.06% find_next_bit                                                           ▒
                                0.05% idle_cpu                                                                ▒
                                0.05% __bitmap_intersects                                                     ▒
                           - 0.55% _raw_spin_lock                                                             ▒
                                native_queued_spin_lock_slowpath                                              ▒
                             0.20% _raw_spin_lock_irqsave                                                     ▒
                             0.14% ttwu_stat                                                                  ▒
                             0.05% task_waking_fair                                                           ▒
                             0.04% _raw_spin_unlock_irqrestore                                                ▒
                           - 0.01% ttwu_do_wakeup                                                             ▒
                              - check_preempt_curr                                                            ▒
                                check_preempt_wakeup                                                          ▒
                             0.01% select_idle_sibling                                                        ▒
                             0.01% activate_task                                                              ▒
                             0.00% native_queued_spin_lock_slowpath                                           ▒
                           - 0.00% set_task_cpu                                                               ▒
                              - migrate_task_rq_fair                                                          ▒
                                remove_entity_load_avg                                                        ▒
                          0.02% ttwu_stat                                                                     ▒
                          0.02% _raw_spin_unlock_irqrestore                                                   ▒
                          0.01% task_waking_fair                                                              ▒
                          0.01% ttwu_do_activate.constprop.90                                                 ▒
                          0.01% _raw_spin_lock                                                                ▒
                          0.01% select_task_rq_fair                                                           ▒
                          0.00% _raw_spin_lock_irqsave                                                        ▒
                          0.00% set_task_cpu                                                                  ▒
                       0.00% try_to_wake_up                                                                   ▒
                    0.01% default_wake_function                                                               ▒
               - 0.20% apic_timer_interrupt                                                                   ▒
                  - smp_apic_timer_interrupt                                                                  ▒
                     - 0.20% irq_exit                                                                         ▒
                        - __do_softirq                                                                        ▒
                           - 0.19% run_rebalance_domains                                                      ▒
                              - rebalance_domains                                                             ▒
                                   0.19% update_blocked_averages                                              ▒
                                   0.00% __compute_runnable_contrib                                           ▒
                                 - 0.00% load_balance                                                         ▒
                                      deactivate_task                                                         ▒
                                      dequeue_task_fair                                                       ▒
                                      dequeue_entity                                                          ▒
                             0.00% cpu_needs_another_gp                                                       ▒
                     - 0.00% local_apic_timer_interrupt                                                       ▒
                        - hrtimer_interrupt                                                                   ▒
                           - __hrtimer_run_queues                                                             ▒
                              - 0.00% tick_sched_timer                                                        ▒
                                   tick_sched_handle.isra.14                                                  ▒
                                 - update_process_times                                                       ▒
                                      0.00% rcu_check_callbacks                                               ▒
                                      0.00% _raw_spin_lock                                                    ▒
                              - 0.00% hrtimer_wakeup                                                          ▒
                                   wake_up_process                                                            ▒
                                   try_to_wake_up                                                             ▒
                                   _raw_spin_lock_irqsave                                                     ▒
                 0.05% _raw_spin_lock_irqsave                                                                 ▒
                 0.01% _raw_spin_unlock_irqrestore                                                            ▒
                 0.01% autoremove_wake_function                                                               ▒
               + 0.01% ret_from_intr                                                                          ▒
              0.00% __wake_up_common                                                                          ▒
              0.00% _raw_spin_unlock_irqrestore                                                               ▒
         - 2.20% __cv_wait_sig                                                                                ▒
            - 2.14% cv_wait_common                                                                            ▒
               - 0.88% schedule                                                                               ▒
                  - 0.85% __schedule                                                                          ▒
                     - 0.39% deactivate_task                                                                  ▒
                        - 0.31% dequeue_task_fair                                                             ▒
                           - 0.25% dequeue_entity                                                             ▒
                              - 0.07% update_curr                                                             ▒
                                   0.02% cpuacct_charge                                                       ▒
                                   0.01% update_min_vruntime                                                  ▒
                                0.03% account_entity_dequeue                                                  ▒
                                0.01% update_cfs_shares                                                       ▒
                                0.01% cpuacct_charge                                                          ▒
                                0.00% clear_buddies                                                           ▒
                                0.00% update_min_vruntime                                                     ▒
                             0.02% account_entity_dequeue                                                     ▒
                             0.01% clear_buddies                                                              ▒
                             0.00% update_min_vruntime                                                        ▒
                             0.00% update_curr                                                                ▒
                             0.00% update_cfs_shares                                                          ▒
                        - 0.05% update_rq_clock.part.80                                                       ▒
                           - sched_clock_cpu                                                                  ▒
                              - 0.01% sched_clock                                                             ▒
                                   native_sched_clock                                                         ▒
                                0.00% native_sched_clock                                                      ▒
                          0.00% dequeue_entity                                                                ▒
                          0.00% hrtick_update                                                                 ▒
                          0.00% sched_clock_cpu                                                               ▒
                     + 0.08% pick_next_task_fair                                                              ▒
                       0.06% finish_task_switch                                                               ▒
                     + 0.03% _raw_spin_lock_irq                                                               ▒
                     + 0.03% pick_next_task_idle                                                              ▒
                       0.01% rcu_note_context_switch                                                          ▒
                       0.01% dequeue_task_fair                                                                ▒
                       0.01% __msecs_to_jiffies                                                               ▒
                       0.00% update_rq_clock.part.80                                                          ▒
                       0.00% put_prev_task_fair                                                               ▒
                     + 0.00% ret_from_intr                                                                    ▒
                     + 0.00% apic_timer_interrupt                                                             ▒
                       0.00% update_curr                                                                      ▒
                       0.00% check_cfs_rq_runtime                                                             ▒
                     + 0.00% reschedule_interrupt                                                             ▒
                    0.01% _raw_spin_lock_irq                                                                  ▒
                    0.01% deactivate_task                                                                     ▒
                    0.00% pick_next_task_idle                                                                 ▒
                    0.00% pick_next_task_fair                                                                 ▒
                    0.00% finish_task_switch                                                                  ▒
                    0.00% rcu_note_context_switch                                                             ▒
                  + 0.00% apic_timer_interrupt                                                                ▒
                    0.00% update_rq_clock.part.80                                                             ▒
               + 0.62% mutex_lock                                                                             ▒
               + 0.15% mutex_unlock                                                                           ▒
               + 0.09% prepare_to_wait_exclusive                                                              ▒
               + 0.08% __wake_up                                                                              ▒
               + 0.01% finish_wait                                                                            ▒
                 0.01% _raw_spin_unlock_irqrestore                                                            ▒
                 0.01% __mutex_lock_slowpath                                                                  ▒
                 0.00% __schedule                                                                             ▒
                 0.00% __mutex_unlock_slowpath                                                                ▒
                 0.00% _cond_resched                                                                          ▒
                 0.00% _raw_spin_lock_irqsave                                                                 ▒
                 0.00% __wake_up_common                                                                       ▒
               + 0.00% ret_from_intr                                                                          ▒
              0.04% mutex_unlock                                                                              ▒
              0.01% mutex_lock                                                                                ▒
            + 0.01% apic_timer_interrupt                                                                      ▒
              0.00% schedule                                                                                  ▒
              0.00% prepare_to_wait_exclusive                                                                 ▒
              0.00% __wake_up                                                                                 ▒
              0.00% finish_wait                                                                               ▒
         - 1.57% mutex_lock                                                                                   ▒
            - 0.31% __mutex_lock_slowpath                                                                     ▒
               - 0.30% mutex_optimistic_spin                                                                  ▒
                    0.15% mutex_spin_on_owner.isra.3                                                          ▒
                    0.05% osq_lock                                                                            ▒
                  - 0.01% apic_timer_interrupt                                                                ▒
                     - smp_apic_timer_interrupt                                                               ▒
                       irq_exit                                                                               ▒
                       __do_softirq                                                                           ▒
                       run_rebalance_domains                                                                  ▒
                       rebalance_domains                                                                      ▒
                       update_blocked_averages                                                                ▒
                    0.01% osq_unlock                                                                          ▒
                 0.00% mutex_spin_on_owner.isra.3                                                             ▒
                 0.00% osq_lock                                                                               ▒
                 0.00% osq_unlock                                                                             ▒
               + 0.00% schedule_preempt_disabled                                                              ▒
              0.00% mutex_optimistic_spin                                                                     ▒
richardelling commented 7 years ago

On Nov 29, 2016, at 8:10 AM, mailinglists35 notifications@github.com wrote:

@ironMann https://github.com/ironMann thank you. I have applied pr 5428 and 587 on top of existing 5412+5425. there is no change in cpu consumption. perf record -ag sleep 10 pasted below

Are you sure you’re not bumping up against quota? One way to check is to see how often the txgs close and the time taken. To do this:

echo 20 >> /sys/module/zfs/parameters/zfs_txg_history cat /proc/spl/kstat/zfs/POOLNAME/txgs

The "stime" column shows the sync time in nanoseconds and the "otime" column shows the open time, which is nominally 5 seconds, by default.

If otime is very short, then it might be busy trying to keep quotas under control or perhaps something is calling sync() in a tight loop (or a few other, less common conditions) — richard

PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 15023 1 0.98s 0.00s 0K 0K 0K 0K -- - R 1 100% txg_sync 15022 1 0.89s 0.00s 0K 0K 0K 0K -- - R 3 92% txg_quiesce

Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833 Children Self Command Shared Object Symbol

  • 44.93% 0.00% init [kernel.kallsyms] [k] zfs_write
  • 44.93% 0.00% init [kernel.kallsyms] [k] zpl_write_common_iovec
  • 44.93% 0.00% init [kernel.kallsyms] [k] zpl_write
  • 44.93% 0.00% init [kernel.kallsyms] [k] __vfs_write
  • 44.93% 0.00% init [kernel.kallsyms] [k] vfs_write
  • 44.93% 0.00% init [kernel.kallsyms] [k] sys_write
  • 44.93% 0.00% init [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
  • 44.42% 1.30% init [kernel.kallsyms] [k] dmu_tx_assign
  • 37.59% 0.19% init [kernel.kallsyms] [k] dmu_tx_wait
  • 36.50% 0.48% init [kernel.kallsyms] [k] txg_wait_open
  • 25.95% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper
  • 25.95% 0.00% txg_sync [kernel.kallsyms] [k] kthread
  • 25.95% 0.00% txg_sync [kernel.kallsyms] [k] ret_from_fork
  • 25.75% 0.68% txg_sync [kernel.kallsyms] [k] txg_sync_thread
  • 24.38% 1.42% init [kernel.kallsyms] [k] mutex_lock
  • 22.98% 0.09% init [kernel.kallsyms] [k] __mutex_lock_slowpath
  • 22.71% 0.71% init [kernel.kallsyms] [k] mutex_optimistic_spin
  • 21.74% 0.03% init [kernel.kallsyms] [k] __cv_wait
  • 21.40% 0.83% init [kernel.kallsyms] [k] cv_wait_common
  • 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper
  • 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] kthread
  • 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] ret_from_fork
  • 17.58% 0.35% txg_sync [kernel.kallsyms] [k] spa_sync
  • 16.88% 3.09% txg_quiesce [kernel.kallsyms] [k] txg_quiesce_thread
  • 12.22% 12.22% init [kernel.kallsyms] [k] mutex_spin_on_owner.isra.3
  • 9.97% 1.49% init [kernel.kallsyms] [k] __schedule
  • 9.74% 0.23% init [kernel.kallsyms] [k] schedule
  • 9.39% 9.39% init [kernel.kallsyms] [k] osq_lock
  • 8.92% 0.11% txg_quiesce [kernel.kallsyms] [k] __cv_broadcast
  • 8.89% 0.03% txg_quiesce [kernel.kallsyms] [k] __wake_up
  • 8.54% 0.22% txg_quiesce [kernel.kallsyms] [k] __wake_up_common
  • 8.31% 0.12% txg_quiesce [kernel.kallsyms] [k] autoremove_wake_function
  • 8.20% 0.02% txg_quiesce [kernel.kallsyms] [k] default_wake_function
  • 8.16% 0.64% txg_quiesce [kernel.kallsyms] [k] try_to_wake_up
  • 7.63% 2.02% txg_sync [kernel.kallsyms] [k] mutex_lock
  • 7.58% 0.12% swapper [kernel.kallsyms] [k] cpu_startup_entry
  • 5.61% 0.01% txg_sync [kernel.kallsyms] [k] __mutex_lock_slowpath
  • 5.58% 0.18% txg_sync [kernel.kallsyms] [k] mutex_optimistic_spin
  • 5.49% 0.02% txg_sync [kernel.kallsyms] [k] zap_contains
  • 5.46% 0.03% txg_sync [kernel.kallsyms] [k] zap_lookup_norm
  • 5.28% 0.06% txg_quiesce [kernel.kallsyms] [k] ttwu_do_activate.constprop.90
  • 4.76% 0.00% swapper [kernel.kallsyms] [k] start_secondary
  • 4.70% 4.70% txg_sync [kernel.kallsyms] [k] osq_lock
  • 4.62% 0.22% txg_quiesce [kernel.kallsyms] [k] activate_task
  • 4.59% 0.12% init [kernel.kallsyms] [k] deactivate_task
  • 4.24% 0.32% init [kernel.kallsyms] [k] dequeue_task_fair
  • 4.18% 0.35% txg_sync [kernel.kallsyms] [k] zio_wait
  • 4.18% 0.34% txg_quiesce [kernel.kallsyms] [k] enqueue_task_fair
  • 3.75% 0.11% txg_sync [kernel.kallsyms] [k] dsl_pool_sync
  • 3.68% 0.76% init [kernel.kallsyms] [k] dequeue_entity
  • 3.34% 0.98% txg_quiesce [kernel.kallsyms] [k] enqueue_entity
  • 3.17% 0.77% init [kernel.kallsyms] [k] pick_next_task_fair
  • 3.08% 0.14% txg_sync [kernel.kallsyms] [k] ddt_sync
  • 3.01% 1.17% swapper [kernel.kallsyms] [k] hrtimer_start_range_ns
  • 2.89% 0.00% swapper [kernel.kallsyms] [k] rest_init
  • 2.89% 0.00% swapper [kernel.kallsyms] [k] start_kernel
  • 2.89% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_reservations
  • 2.89% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_kernel
  • 2.70% 0.03% txg_sync [kernel.kallsyms] [k] zap_lockdir
  • 2.65% 0.03% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
  • 2.56% 0.13% swapper [kernel.kallsyms] [k] __tick_nohz_idle_enter
  • 2.37% 0.03% txg_sync [kernel.kallsyms] [k] dmu_buf_hold
  • 2.34% 0.68% init [kernel.kallsyms] [k] __cv_broadcast
  • 2.31% 0.02% txg_sync [kernel.kallsyms] [k] zap_lookup_impl
  • 2.30% 1.43% txg_quiesce [kernel.kallsyms] [k] mutex_lock
  • 2.23% 0.07% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
  • 2.21% 0.01% txg_quiesce [kernel.kallsyms] [k] __cv_wait_sig
  • 2.14% 0.04% txg_sync [kernel.kallsyms] [k] dmu_buf_hold_noread
  • 2.14% 0.28% txg_quiesce [kernel.kallsyms] [k] cv_wait_common
  • 2.00% 0.47% txg_sync [kernel.kallsyms] [k] zio_done
  • 1.97% 0.02% swapper [kernel.kallsyms] [k] tick_nohz_restart
  • 1.91% 1.82% txg_sync [kernel.kallsyms] [k] mutex_unlock
  • 1.78% 0.00% init libc-2.19.so [.] 0xffff80b517d9d710
  • 1.77% 0.03% txg_sync [kernel.kallsyms] [k] fzap_lookup
  • 1.69% 0.00% init libc-2.19.so [.] 0xffff8065c7dcc710
  • 1.68% 0.88% init [kernel.kallsyms] [k] update_cfs_shares
  • 1.67% 0.32% init [kernel.kallsyms] [k] set_next_entity
  • 1.66% 0.04% init [kernel.kallsyms] [k] __wake_up
  • 1.66% 0.00% init [unknown] [k] 0x00005642200e6170 For a higher level overview, try: perf report --sort comm,dso

Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833 Children Self Command Shared Object Symbol ◆

  • 44.93% 0.00% init [kernel.kallsyms] [k] zfs_write ▒
    • zfs_write ▒
      • 44.42% dmu_tx_assign ▒
        • 37.58% dmu_tx_wait ▒
        • 36.46% txg_wait_open ▒
          • 21.71% __cv_wait ▒
            • 21.39% cv_wait_common ▒
              • 10.38% mutex_lock ▒
              • 10.19% __mutex_lock_slowpath ▒
                • 10.10% mutex_optimistic_spin ▒ 4.86% mutex_spin_on_owner.isra.3 ▒ 4.79% osq_lock ▒ 0.08% osq_unlock ▒
                  • 0.05% schedule_preempt_disabled ▒
                  • 0.04% apic_timer_interrupt ▒ 0.00% ret_from_intr ▒
                • 0.04% schedule_preempt_disabled ▒
                • 0.01% _raw_spin_lock ▒ 0.01% mutex_spin_on_owner.isra.3 ▒ 0.01% osq_unlock ▒ 0.00% osq_lock ▒
                • 0.00% reschedule_interrupt ▒
                • 0.00% apic_timer_interrupt ▒ 0.00% mutex_optimistic_spin ▒
              • 9.38% schedule ▒
              • 9.23% __schedule ▒
                • 4.55% deactivate_task ▒
                  • 4.19% dequeue_task_fair ▒
                    • 3.61% dequeue_entity ▒
                    • 1.54% update_cfs_shares ▒
                      • 0.66% update_curr ▒ 0.20% __calc_delta ▒ 0.13% update_min_vruntime ▒ 0.05% account_entity_enqueue ▒ 0.04% account_entity_dequeue ▒ 0.02% __calc_delta ▒ 0.02% update_min_vruntime ▒
                    • 0.73% update_curr ▒ 0.29% account_entity_dequeue ▒ 0.22% cpuacct_charge ▒ 0.06% update_min_vruntime ▒ 0.03% account_entity_enqueue ▒ 0.02% clear_buddies ▒ 0.09% update_cfs_shares ▒ 0.07% account_entity_dequeue ▒ 0.04% set_next_buddy ▒ 0.03% update_curr ▒ 0.02% update_min_vruntime ▒ 0.02% clear_buddies ▒ 0.00% __compute_runnable_contrib ▒
                  • 0.18% update_rq_clock.part.80 ▒ 0.03% dequeue_entity ▒ 0.01% sched_clock_cpu ▒ 0.01% update_cfs_shares ▒ 0.01% set_next_buddy ▒ 0.00% hrtick_update ▒
                • 3.06% pick_next_task_fair ▒
                • 0.08% _raw_spin_lock_irq ▒ 0.07% finish_task_switch ▒
                • 0.05% apic_timer_interrupt ▒ 0.03% pick_next_entity ▒ 0.03% put_prev_entity ▒ 0.02% check_cfs_rq_runtime ▒ 0.02% set_next_entity ▒ 0.01% rcu_note_context_switch ▒ 0.01% dequeue_task_fair ▒ 0.01% update_curr ▒ 0.01% update_rq_clock.part.80 ▒
                • 0.00% pick_next_task_idle ▒
                • 0.00% reschedule_interrupt ▒ 0.00% __msecs_to_jiffies ▒ 0.00% smp_reschedule_interrupt ▒ 0.00% ret_from_intr ▒ 0.01% rcu_note_context_switch ▒ 0.01% _raw_spin_lock_irq ▒ 0.01% pick_next_task_fair ▒ 0.01% finish_task_switch ▒ 0.01% deactivate_task ▒ 0.00% pick_next_task_idle ▒
              • 0.48% prepare_to_wait_exclusive ▒
              • 0.22% mutex_unlock ▒ 0.02% __schedule ▒
              • 0.02% finish_wait ▒ 0.02% _raw_spin_unlock_irqrestore ▒ 0.01% mutex_lock_slowpath ▒ 0.01% _cond_resched ▒ 0.00% _raw_spin_lock_irqsave ▒ 0.00% mutex_unlock_slowpath ▒
              • 0.00% apic_timer_interrupt ▒ 0.18% mutex_unlock ▒ 0.12% schedule ▒ 0.01% prepare_to_wait_exclusive ▒ 0.01% finish_wait ▒ 0.01% mutex_lock ▒
            • 0.01% apic_timer_interrupt ▒
          • 12.11% mutex_lock ▒
mailinglists35 commented 7 years ago

@richardelling thank you for the hint. indeed otime is vert short. what would you recommend to do next?

this is with pool idle, then with pool hung:

41 0 0x01 20 2240 355956345204422 592620693180043
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
390335741 592524089526612  C     0            0            0            0        0        4999813463   9558         10102        34374
390335742 592529089340075  C     0            0            0            0        0        5000066598   7923         80330        74050
390335743 592534089406673  C     0            0            0            0        0        4999952299   6910         43929        52993
390335744 592539089358972  C     0            0            0            0        0        4999737267   7072         8194         54133
390335745 592544089096239  C     0            4608         0            2        0        4999978265   7553         8449         4681435
390335746 592549089074504  C     0            0            0            0        0        4999966827   5928         11332        57286
390335747 592554089041331  C     0            0            0            0        0        4999991506   5522         12925        55682
390335748 592559089032837  C     0            0            0            0        0        4999964554   6597         14889        51977
390335749 592564088997391  C     0            0            0            0        0        4999972149   9605         19025        74637
390335750 592569088969540  C     0            0            0            0        0        4999905451   7589         19669        70838
390335751 592574088874991  C     0            0            0            0        0        4999905973   7486         21393        75209
390335752 592579088780964  C     0            0            0            0        0        4999947644   7569         20411        73451
390335753 592584088728608  C     0            0            0            0        0        4999932733   7519         21381        73751
390335754 592589088661341  C     0            0            0            0        0        4999927671   6868         18351        66735
390335755 592594088589012  C     0            0            0            0        0        4999959805   8398         21063        74540
390335756 592599088548817  C     0            0            0            0        0        4999940195   7516         21099        74230
390335757 592604088489012  C     0            0            0            0        0        4999936205   9029         21370        73839
390335758 592609088425217  C     0            0            0            0        0        4999946138   8542         21363        73451
390335759 592614088371355  C     0            0            0            0        0        4999930517   9123         21093        72391
390335760 592619088301872  O     0            0            0            0        0        0            0            0            0

41 0 0x01 20 2240 355956345204422 593059687794602
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
394798944 593059687507669  C     0            0            0            0        0        15431        3042         10984        15352
394798945 593059687523100  C     0            0            0            0        0        18348        3407         9931         12753
394798946 593059687541448  C     0            0            0            0        0        14757        2928         11041        11982
394798947 593059687556205  C     0            0            0            0        0        15195        3195         9883         11183
394798948 593059687571400  C     0            0            0            0        0        14734        3007         8949         12170
394798949 593059687586134  C     0            0            0            0        0        13584        3234         9789         12375
394798950 593059687599718  C     0            0            0            0        0        14194        2946         10668        11913
394798951 593059687613912  C     0            0            0            0        0        14842        3078         9861         11320
394798952 593059687628754  C     0            0            0            0        0        14130        3099         9529         11692
394798953 593059687642884  C     0            0            0            0        0        14128        3305         9119         11686
394798954 593059687657012  C     0            0            0            0        0        13594        2922         9818         11811
394798955 593059687670606  C     0            0            0            0        0        13931        3091         9930         12656
394798956 593059687684537  C     0            0            0            0        0        14249        3182         10783        13954
394798957 593059687698786  C     0            0            0            0        0        18395        3444         8221         11817
394798958 593059687717181  C     0            0            0            0        0        15273        2930         7713         11697
394798959 593059687732454  C     0            0            0            0        0        11932        3311         9513         11913
394798960 593059687744386  C     0            0            0            0        0        13893        3217         9898         11206
394798961 593059687758279  S     0            0            0            0        0        14526        3284         8772         0
394798962 593059687772805  W     0            0            0            0        0        16397        3030         0            0
394798963 593059687789202  O     0            0            0            0        0        0            0            0            0
richardelling commented 7 years ago

zfs list -o space is a good way to see if you're near a quota limit.

mailinglists35 commented 7 years ago

I am away from that pool over the weekend so I won't be able to paste the output, but I know I have not setup quota on any dataset. The pool is using about 10% of its capacity. I will ask on lxd github if they setup quota by default on their datasets, but I highly doubt quota has anything to do with this, because if I start the containers one by one I don't experience the hang. I must reiterate what I've said before, that this only happens when I give lxd many simultaneous tasks and at the same time there is little free ram.

mailinglists35 commented 7 years ago

I have also managed to produce similar txg_sync and txg_quiesce cpu load by running many 'sync' commands in parallel, but in that case the pool is still usable and the tasks do not stay constantly at 100% and 90% like in my case.

mailinglists35 commented 7 years ago

@richardelling snip from zfs get quota and zfs list -o space; full output on pastebin if you want (all datasets have value 'none' with source 'default' for quota and refreserv attributes)

# zfs get -r quota pool|grep -v -- zfs-auto-snap
NAME                                                                                 PROPERTY  VALUE  SOURCE
pool/lxd                                                                          quota     none   default
pool/lxd/containers                                                               quota     none   default
pool/lxd/containers/gitlab                                                        quota     none   default

# zfs get refreserv -t filesystem pool/lxd -r
NAME                                                                                         PROPERTY        VALUE      SOURCE
pool/lxd                                                                                  refreservation  none       default
pool/lxd/containers                                                                       refreservation  none       default
pool/lxd/containers/gitlab                                                                refreservation  none       default
pool/lxd/containers/hang                                                                  refreservation  none       default
pool/lxd/containers/ipsec-client                                                          refreservation  none       default

# zfs list -o space -r pool/lxd
NAME                                                                                         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
pool/lxd                                                                                   875G  30.9G        9K     19K              0      30.9G
pool/lxd/containers                                                                        875G  27.7G         0     19K              0      27.7G
pool/lxd/containers/gitlab                                                                 875G  1010M      482M    528M              0          0
pool/lxd/containers/hang                                                                   875G    47K         0     47K              0          0
pool/lxd/containers/ipsec-client                                                           875G   582M      477M    105M              0          0

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
pool  1008G   101G   907G         -     6%    10%  1.00x  ONLINE  -
cwedgwood commented 6 years ago

@behlendorf @richardelling

so i hit this today with fairly recent zfs and kernel (4.9.20+) txg_sync & txg_quiesce were pegged at 100% cpu as were some qemu processes that were (transiently) stuck in D and unable to terminate

i ended up removing l2arc ssd's from the pool and everything came unstuck...

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.