Closed mailinglists35 closed 3 years ago
zpool get all | grep -v default
NAME PROPERTY VALUE SOURCE
pool size 1008G -
pool capacity 9% -
pool health ONLINE -
pool guid 7367461532570993000 -
pool failmode continue local
pool dedupratio 1.00x -
pool free 911G -
pool allocated 96.7G -
pool readonly off -
pool comment local
pool expandsize - -
pool freeing 0 -
pool fragmentation 5% -
pool leaked 0 -
pool feature@async_destroy enabled local
pool feature@empty_bpobj active local
pool feature@lz4_compress active local
pool feature@spacemap_histogram active local
pool feature@enabled_txg active local
pool feature@hole_birth active local
pool feature@extensible_dataset enabled local
pool feature@embedded_data active local
pool feature@bookmarks enabled local
pool feature@filesystem_limits enabled local
pool feature@large_blocks enabled local
pool feature@large_dnode disabled local
pool feature@sha512 disabled local
pool feature@skein disabled local
pool feature@edonr disabled local
pool feature@userobj_accounting disabled local
pool: pool
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0 in 0h6m with 0 errors on Mon Nov 7 16:24:43 2016
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
scsi-3600508e000000000aa5b6d24a4b4ad04-part8 ONLINE 0 0 0
errors: No known data errors
on the pool i have 600 small containers running https://github.com/lxc/lxd master (each container is a zfs clone of a dataset, looks like this:
zfs get origin
[...]
pool/lxd/containers/terminal99 origin pool/lxd/images/d0e4b74306ae904d45efc81c972deee4b862c5af6d753ed99cfb98fee95a5fba@readonly -
[...600 identical from 1 to 600...]
additionally on the system there is zfs-auto-snapshot script active
please let me know if you need any other additional info from the system
@mailinglists35 this could be explained by a process, or many processes, repeatedly fetching the pool statistics. i.e. zpool status 0.01
or some such. The high cpu usage of vdev_get_stats_ex_impl
is what makes me suspect something like this.
@behlendorf zpool status 0.001 does allow me to use the pool and I don't see txg_sync spinning 100% (actually i don't see it along the list of top cpu consumers), while in the above case i can't use the pool (file i/o hangs) and txg_sync stays at 100%
here is zpool 0.001 'perf top -p $pidof txg_sync
Overhead Shared O Symbol
65.01% [kernel] [k] ddt_get_dedup_histogram
26.45% [kernel] [k] zio_done
7.60% [kernel] [k] mutex_lock
0.81% [kernel] [k] ioremap_page_range
0.12% [kernel] [k] nmi_cpu_backtrace
0.01% [kernel] [k] native_write_msr_safe
instead zpool is here using the most cpu
PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/1
28770 0.18s 0.04s 0K 0K 0K 0K -- - S 23% zpool
@behlendorf to rule out the case of someone repeatedly fetching pool statistics, is it simple to add rate limiting to fetching pool statistics? something like 1 per second? in that case, should the pool continue to be usable even if someone attempts repeatedly to fetch statistics?
update: noticed this happens if i launch many (~200) lxd containers at once. there is enough free and cached memory. the containers are running debian trusty with no daemons (just init and dhclient)
I have seen this before in my profilings but I was not paying too much attention.
@mailinglists35 Can you collect call trace for ddt_get_dedup_histogram()
(perf top -ag
)?
This is one possibility:
txg_sync()
spa_sync()
spa_update_dspace()
ddt_get_dedup_dspace()
ddt_get_dedup_stats()
ddt_get_dedup_histogram()
A call trace would be very helpful. Another possible option which makes sense would be the following. I think it may be more likely because I see vdev_get_stats_ex_impl
consuming a lot of time in your perf results. Stalling the txg sync thread would also lead to the hangs your describing.
txg_sync_thread()
vdev_get_stats()
vdev_get_stats_ex()
vdev_get_stats_ex_impl()
Edit: Could you apply the #5412 PR and see if it resolves the issue.
Here is a small patch for ddt_get_dedup_dspace()
. The method goes through 168KiB of ddt histograms on each invocation. The return value of ddt_get_dedup_dspace()
can be memoized if there were no changes in ddt_histogram_cache
.
diff --git a/include/sys/spa_impl.h b/include/sys/spa_impl.h
index 88bde98..89ca524 100644
--- a/include/sys/spa_impl.h
+++ b/include/sys/spa_impl.h
@@ -239,6 +239,7 @@ struct spa {
uint64_t spa_autoexpand; /* lun expansion on/off */
ddt_t *spa_ddt[ZIO_CHECKSUM_FUNCTIONS]; /* in-core DDTs */
uint64_t spa_ddt_stat_object; /* DDT statistics */
+ uint64_t spa_dedup_dspace; /* Memoize get_dedup_dspace() */
uint64_t spa_dedup_ditto; /* dedup ditto threshold */
uint64_t spa_dedup_checksum; /* default dedup checksum */
uint64_t spa_dspace; /* dspace in normal class */
diff --git a/module/zfs/ddt.c b/module/zfs/ddt.c
index 09a3536..6335e00 100644
--- a/module/zfs/ddt.c
+++ b/module/zfs/ddt.c
@@ -528,10 +528,17 @@ ddt_get_dedup_stats(spa_t *spa, ddt_stat_t *dds_total)
uint64_t
ddt_get_dedup_dspace(spa_t *spa)
{
- ddt_stat_t dds_total = { 0 };
+ ddt_stat_t dds_total;
+
+ if (spa->spa_dedup_dspace != ~0ULL)
+ return spa->spa_dedup_dspace;
+
+ /* Calculate and memoize the stats */
+ bzero(&dds_total, sizeof (ddt_stat_t));
ddt_get_dedup_stats(spa, &dds_total);
- return (dds_total.dds_ref_dsize - dds_total.dds_dsize);
+ spa->spa_dedup_dspace = dds_total.dds_ref_dsize - dds_total.dds_dsize;
+ return (spa->spa_dedup_dspace);
}
uint64_t
@@ -915,6 +922,7 @@ ddt_load(spa_t *spa)
*/
bcopy(ddt->ddt_histogram, &ddt->ddt_histogram_cache,
sizeof (ddt->ddt_histogram));
+ spa->spa_dedup_dspace = ~0ULL;
}
return (0);
@@ -1182,6 +1190,7 @@ ddt_sync_table(ddt_t *ddt, dmu_tx_t *tx, uint64_t txg)
bcopy(ddt->ddt_histogram, &ddt->ddt_histogram_cache,
sizeof (ddt->ddt_histogram));
+ spa->spa_dedup_dspace = ~0ULL;
}
void
diff --git a/module/zfs/spa_misc.c b/module/zfs/spa_misc.c
index 6ec0521..1857b9a 100644
--- a/module/zfs/spa_misc.c
+++ b/module/zfs/spa_misc.c
@@ -635,6 +635,9 @@ spa_add(const char *name, nvlist_t *config, const char *altroot)
spa->spa_min_ashift = INT_MAX;
spa->spa_max_ashift = 0;
+ /* Reset memoized value */
+ spa->spa_dedup_dspace = ~0ULL;
+
/*
* As a pool is being created, treat all features as disabled by
* setting SPA_FEATURE_DISABLED for all entries in the feature
thank you for the quick replies. I am away from that system over the weeked and I will test the patches on monday
I've opened PR #5412 with a cleaner version of the same change. Please use that when testing. You'll also want to make sure that zfs_txg_history=0
, this is the default value.
@behlendorf I have applied PR #5412 but there is no change. The perf -ag shows the call path suggested by @ironMann @ironMann, I will try your patch now (output below is just with PR 5412 applied)
Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
Children Self Command Shared Object Symbol ◆
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper ▒
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] kthread ▒
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] ret_from_fork ▒
+ 36.07% 0.41% txg_sync [kernel.kallsyms] [k] txg_sync_thread ▒
+ 33.93% 0.29% txg_sync [kernel.kallsyms] [k] spa_sync ▒
+ 24.88% 0.00% lxd [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath ▒
+ 24.31% 0.01% txg_sync [kernel.kallsyms] [k] spa_update_dspace ▒
+ 24.28% 0.02% txg_sync [kernel.kallsyms] [k] ddt_get_dedup_dspace ▒
+ 24.24% 0.02% txg_sync [kernel.kallsyms] [k] ddt_get_dedup_stats ▒
- 23.15% 23.15% txg_sync [kernel.kallsyms] [k] ddt_get_dedup_histogram ▒
ret_from_fork ▒
kthread ▒
thread_generic_wrapper ▒
txg_sync_thread ▒
spa_sync ▒
spa_update_dspace ▒
- ddt_get_dedup_dspace ▒
- 23.15% ddt_get_dedup_stats ▒
ddt_get_dedup_histogram ▒
0.00% ddt_get_dedup_histogram ▒
Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
Children Self Command Shared Object Symbol ◆
- 36.18% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper ▒
- thread_generic_wrapper ▒
- 36.07% txg_sync_thread ▒
- 33.92% spa_sync ▒
- 24.31% spa_update_dspace ▒
- 24.28% ddt_get_dedup_dspace ▒
- 24.24% ddt_get_dedup_stats ▒
23.15% ddt_get_dedup_histogram ▒
0.63% ddt_histogram_stat
--
Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
Children Self Command Shared Object Symbol ◆
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper ▒
- 36.18% 0.00% txg_sync [kernel.kallsyms] [k] kthread ▒
kthread ▒
- thread_generic_wrapper ▒
- 36.07% txg_sync_thread ▒
- 33.92% spa_sync ▒
- 24.31% spa_update_dspace ▒
- 24.28% ddt_get_dedup_dspace ▒
- 24.24% ddt_get_dedup_stats ▒
23.15% ddt_get_dedup_histogram ▒
Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
Children Self Command Shared Object Symbol ◆
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper ▒
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] kthread ▒
- 36.18% 0.00% txg_sync [kernel.kallsyms] [k] ret_from_fork ▒
ret_from_fork ▒
kthread ▒
- thread_generic_wrapper ▒
- 36.07% txg_sync_thread ▒
- 33.92% spa_sync ▒
- 24.31% spa_update_dspace ▒
- 24.28% ddt_get_dedup_dspace ▒
- 24.24% ddt_get_dedup_stats ▒
23.15% ddt_get_dedup_histogram ▒
Samples: 115K of event 'cycles:pp', Event count (approx.): 85749218893
Children Self Command Shared Object Symbol ◆
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper ▒
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] kthread ▒
+ 36.18% 0.00% txg_sync [kernel.kallsyms] [k] ret_from_fork ▒
+ 36.07% 0.41% txg_sync [kernel.kallsyms] [k] txg_sync_thread ▒
+ 33.93% 0.29% txg_sync [kernel.kallsyms] [k] spa_sync ▒
+ 24.88% 0.00% lxd [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath ▒
+ 24.31% 0.01% txg_sync [kernel.kallsyms] [k] spa_update_dspace ▒
+ 24.28% 0.02% txg_sync [kernel.kallsyms] [k] ddt_get_dedup_dspace ▒
+ 24.24% 0.02% txg_sync [kernel.kallsyms] [k] ddt_get_dedup_stats ▒
+ 23.15% 23.15% txg_sync [kernel.kallsyms] [k] ddt_get_dedup_histogram ▒
+ 20.93% 0.00% lxd libc-2.23.so [.] fchownat ▒
+ 20.26% 0.00% lxd [kernel.kallsyms] [k] zfs_setattr ▒
+ 20.26% 0.00% lxd [kernel.kallsyms] [k] zpl_setattr ▒
+ 20.26% 0.00% lxd [kernel.kallsyms] [k] notify_change ▒
+ 20.26% 0.00% lxd [kernel.kallsyms] [k] chown_common ▒
+ 20.26% 0.00% lxd [kernel.kallsyms] [k] sys_fchownat ▒
+ 20.24% 0.40% lxd [kernel.kallsyms] [k] dmu_tx_assign ▒
+ 19.23% 0.09% lxd [kernel.kallsyms] [k] dmu_tx_wait ▒
+ 18.57% 0.11% lxd [kernel.kallsyms] [k] txg_wait_open ▒
+ 17.27% 0.00% init [kernel.kallsyms] [k] zfs_write ▒
+ 17.27% 0.00% init [kernel.kallsyms] [k] zpl_write_common_iovec ▒
+ 17.27% 0.00% init [kernel.kallsyms] [k] zpl_write ▒
+ 17.27% 0.00% init [kernel.kallsyms] [k] __vfs_write ▒
+ 17.27% 0.00% init [kernel.kallsyms] [k] vfs_write ▒
+ 17.27% 0.00% init [kernel.kallsyms] [k] sys_write ▒
+ 17.27% 0.00% init [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath ▒
+ 17.08% 0.39% init [kernel.kallsyms] [k] dmu_tx_assign ▒
+ 14.70% 0.08% init [kernel.kallsyms] [k] dmu_tx_wait ▒
+ 14.22% 0.07% init [kernel.kallsyms] [k] txg_wait_open ▒
+ 10.87% 0.01% lxd [kernel.kallsyms] [k] __cv_wait ▒
+ 10.76% 0.88% lxd [kernel.kallsyms] [k] mutex_lock ▒
+ 10.74% 0.47% lxd [kernel.kallsyms] [k] cv_wait_common ▒
+ 9.89% 0.04% lxd [kernel.kallsyms] [k] __mutex_lock_slowpath ▒
+ 9.80% 0.35% lxd [kernel.kallsyms] [k] mutex_optimistic_spin ▒
+ 9.77% 0.01% init [kernel.kallsyms] [k] __cv_wait ▒
+ 9.70% 0.28% init [kernel.kallsyms] [k] cv_wait_common ▒
+ 8.13% 0.57% init [kernel.kallsyms] [k] mutex_lock ▒
+ 7.57% 0.03% init [kernel.kallsyms] [k] __mutex_lock_slowpath ▒
+ 7.46% 0.28% init [kernel.kallsyms] [k] mutex_optimistic_spin ▒
+ 7.41% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper ▒
+ 7.41% 0.00% txg_quiesce [kernel.kallsyms] [k] kthread ▒
+ 7.41% 0.00% txg_quiesce [kernel.kallsyms] [k] ret_from_fork ▒
+ 7.14% 0.28% txg_quiesce [kernel.kallsyms] [k] txg_quiesce_thread ▒
+ 5.58% 5.58% lxd [kernel.kallsyms] [k] mutex_spin_on_owner.isra.3 ▒
+ 5.09% 0.70% lxd [kernel.kallsyms] [k] __schedule ▒
+ 4.80% 0.07% lxd [kernel.kallsyms] [k] schedule ▒
+ 4.68% 0.02% txg_quiesce [kernel.kallsyms] [k] __wake_up ▒
+ 4.61% 0.02% txg_quiesce [kernel.kallsyms] [k] __cv_broadcast ▒
+ 4.58% 0.79% init [kernel.kallsyms] [k] __schedule ▒
+ 4.53% 0.14% txg_quiesce [kernel.kallsyms] [k] __wake_up_common ▒
+ 4.42% 0.08% init [kernel.kallsyms] [k] schedule ▒
+ 4.39% 0.08% txg_quiesce [kernel.kallsyms] [k] autoremove_wake_function ▒
+ 4.39% 0.00% swapper [kernel.kallsyms] [k] rest_init ▒
+ 4.39% 0.00% swapper [kernel.kallsyms] [k] start_kernel ▒
+ 4.39% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_reservations ▒
+ 4.39% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_kernel ▒
+ 4.35% 0.11% swapper [kernel.kallsyms] [k] cpu_startup_entry ▒
+ 4.32% 0.00% lxd libpthread-2.23.so [.] __GI___libc_open ▒
+ 4.30% 0.01% txg_quiesce [kernel.kallsyms] [k] default_wake_function ▒
@ironMann it will take me another week until I will manage to learn how to apply your patch. I finally managed to learn how to apply PR 5412 with git apply --check / git apply "downloaded file <URL of behlendorf zfs fork with issue 5400>.patch"
but on yours I tried copy/pasting the code as file.patch and ran patch -p1 < /path/to/file.patch
inside zfs-dkms src dir, but it fails me with below output even after reverting pr 5412 (also git apply
fails)
# patch --dry-run -p1 < /usr/local/src/zfs_ddt.patch
checking file include/sys/spa_impl.h
Hunk #1 FAILED at 239.
1 out of 1 hunk FAILED
checking file module/zfs/ddt.c
Hunk #1 FAILED at 528.
Hunk #2 FAILED at 915.
Hunk #3 FAILED at 1182.
3 out of 3 hunks FAILED
checking file module/zfs/spa_misc.c
Hunk #1 FAILED at 635.
1 out of 1 hunk FAILED
@mailinglists35 checkout a clean copy of rc2 and git apply should finish without errors
On Nov 25, 2016 09:38, "mailinglists35" notifications@github.com wrote:
@ironMann https://github.com/ironMann it will take me another week until I will manage to learn how to apply your patch. I managed to apply PR 5412 with git apply --check / git apply "<URL of behlendorf zfs fork with issue 5400>.patch" but on yours I tried copy/pasting the code as file.patch and ran patch -p1 < /path/to/file.patch inside zfs-dkms src dir, but it fails me with below output even after reverting pr 5412
patch --dry-run -p1 < /usr/local/src/zfs_ddt.patch
checking file include/sys/spa_impl.h Hunk #1 FAILED at 239. 1 out of 1 hunk FAILED checking file module/zfs/ddt.c Hunk #1 FAILED at 528. Hunk #2 FAILED at 915. Hunk #3 FAILED at 1182. 3 out of 3 hunks FAILED checking file module/zfs/spa_misc.c Hunk #1 FAILED at 635. 1 out of 1 hunk FAILED
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/5400#issuecomment-262906005, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQ4zLD7hiDKOjOvtBJxleknTrI_jkvkks5rBp6NgaJpZM4KyTre .
@ironMann these are the steps i'm doing:
select the text from your comment using google chrome on windows, and paste on pastebinit as http://pastebin.com/raw/W7znZ5X7
open putty, connect to server
vi patch.txt
, paste selected text
wget --content-disposition http://pastebin.com/raw/W7znZ5X7
dos2unix W7znZ5X7
wget --content-disposition https://github.com/zfsonlinux/zfs/releases/download/zfs-0.7.0-rc2/zfs-0.7.0-rc2.tar.gz
tar -xzvf zfs-0.7.0-rc2.tar.gz
cd zfs-0.7.0
#git apply --check ../patch.txt
error: patch failed: include/sys/spa_impl.h:239
error: include/sys/spa_impl.h: patch does not apply
error: patch failed: module/zfs/ddt.c:528
error: module/zfs/ddt.c: patch does not apply
error: patch failed: module/zfs/spa_misc.c:635
error: module/zfs/spa_misc.c: patch does not apply
# git apply --check ../W7znZ5X7
fatal: corrupt patch at line 66
oddly the two files are reported by diff to be nearly identical, with the edited one having a newline at end of line after I pasted
# diff patch.txt W7znZ5X7
66,67c66
< * setting SPA_FEATURE_DISABLED for all entries in the feature
<
---
> * setting SPA_FEATURE_DISABLED for all entries in the feature
\ No newline at end of file
what I am doing wrong?
@mailinglists35 Not sure what is the cause. I opened a PR #5425 with the patch.
thank you, it applied cleanly. running now with both pr 5425 and 5412, still hangs. new perf report -g output:
Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
Children Self Command Shared Object Symbol ◆
+ 43.81% 0.00% init [kernel.kallsyms] [k] zfs_write ▒
+ 43.81% 0.00% init [kernel.kallsyms] [k] zpl_write_common_iovec ▒
+ 43.81% 0.00% init [kernel.kallsyms] [k] zpl_write ▒
+ 43.81% 0.00% init [kernel.kallsyms] [k] __vfs_write ▒
+ 43.81% 0.00% init [kernel.kallsyms] [k] vfs_write ▒
+ 43.81% 0.00% init [kernel.kallsyms] [k] sys_write ▒
+ 43.81% 0.00% init [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath ▒
+ 43.26% 0.92% init [kernel.kallsyms] [k] dmu_tx_assign ▒
+ 36.19% 0.14% init [kernel.kallsyms] [k] dmu_tx_wait ▒
+ 34.95% 0.22% init [kernel.kallsyms] [k] txg_wait_open ▒
+ 22.41% 0.02% init [kernel.kallsyms] [k] __cv_wait ▒
+ 22.34% 1.54% init [kernel.kallsyms] [k] mutex_lock ▒
+ 22.16% 0.69% init [kernel.kallsyms] [k] cv_wait_common ▒
+ 20.81% 0.09% init [kernel.kallsyms] [k] __mutex_lock_slowpath ▒
+ 20.51% 0.83% init [kernel.kallsyms] [k] mutex_optimistic_spin ▒
+ 19.88% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper ▒
+ 19.88% 0.00% txg_sync [kernel.kallsyms] [k] kthread ▒
+ 19.88% 0.00% txg_sync [kernel.kallsyms] [k] ret_from_fork ▒
+ 19.70% 0.60% txg_sync [kernel.kallsyms] [k] txg_sync_thread ▒
+ 18.28% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper ▒
+ 18.28% 0.00% txg_quiesce [kernel.kallsyms] [k] kthread ▒
+ 18.28% 0.00% txg_quiesce [kernel.kallsyms] [k] ret_from_fork ▒
+ 16.56% 1.43% txg_quiesce [kernel.kallsyms] [k] txg_quiesce_thread ▒
+ 14.35% 0.33% txg_sync [kernel.kallsyms] [k] spa_sync ▒
+ 11.42% 11.42% init [kernel.kallsyms] [k] mutex_spin_on_owner.isra.3 ▒
+ 10.75% 1.69% init [kernel.kallsyms] [k] __schedule ▒
+ 10.33% 0.09% init [kernel.kallsyms] [k] schedule ▒
+ 8.93% 0.00% kswapd0 [kernel.kallsyms] [k] kthread ▒
+ 8.93% 0.00% kswapd0 [kernel.kallsyms] [k] ret_from_fork ▒
+ 8.91% 0.01% kswapd0 [kernel.kallsyms] [k] kswapd ▒
+ 8.84% 0.14% txg_quiesce [kernel.kallsyms] [k] __cv_broadcast ▒
+ 8.79% 0.02% txg_quiesce [kernel.kallsyms] [k] __wake_up ▒
+ 8.54% 0.21% txg_quiesce [kernel.kallsyms] [k] __wake_up_common ▒
+ 8.33% 0.11% txg_quiesce [kernel.kallsyms] [k] autoremove_wake_function ▒
+ 8.22% 0.01% txg_quiesce [kernel.kallsyms] [k] default_wake_function ▒
+ 8.18% 0.64% txg_quiesce [kernel.kallsyms] [k] try_to_wake_up ▒
+ 8.11% 0.20% kswapd0 [kernel.kallsyms] [k] shrink_zone ▒
+ 7.77% 7.77% init [kernel.kallsyms] [k] osq_lock ▒
+ 7.49% 0.12% swapper [kernel.kallsyms] [k] cpu_startup_entry ▒
+ 5.25% 0.06% txg_quiesce [kernel.kallsyms] [k] ttwu_do_activate.constprop.90 ▒
+ 4.92% 2.06% kswapd0 [kernel.kallsyms] [k] shrink_slab.part.40 ▒
+ 4.90% 0.11% init [kernel.kallsyms] [k] deactivate_task ▒
+ 4.86% 1.67% txg_sync [kernel.kallsyms] [k] mutex_lock ▒
+ 4.59% 0.22% txg_quiesce [kernel.kallsyms] [k] activate_task ▒
+ 4.55% 0.03% txg_sync [kernel.kallsyms] [k] zap_contains ▒
+ 4.53% 0.41% init [kernel.kallsyms] [k] dequeue_task_fair ▒
+ 4.51% 0.03% txg_sync [kernel.kallsyms] [k] zap_lookup_norm ▒
+ 4.50% 0.00% swapper [kernel.kallsyms] [k] start_secondary ▒
+ 4.16% 0.36% txg_quiesce [kernel.kallsyms] [k] enqueue_task_fair ▒
+ 3.86% 0.83% init [kernel.kallsyms] [k] dequeue_entity ▒
+ 3.36% 0.82% init [kernel.kallsyms] [k] pick_next_task_fair ▒
+ 3.32% 0.23% txg_sync [kernel.kallsyms] [k] zio_wait ▒
+ 3.30% 0.97% txg_quiesce [kernel.kallsyms] [k] enqueue_entity ▒
+ 3.19% 0.01% txg_sync [kernel.kallsyms] [k] __mutex_lock_slowpath ▒
+ 3.17% 0.13% txg_sync [kernel.kallsyms] [k] mutex_optimistic_spin ▒
+ 3.06% 0.00% swapper [kernel.kallsyms] [k] rest_init ▒
+ 3.06% 0.00% swapper [kernel.kallsyms] [k] start_kernel ▒
+ 3.06% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_reservations ▒
+ 3.06% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_kernel ▒
Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
Children Self Command Shared Object Symbol ◆
- 43.81% 0.00% init [kernel.kallsyms] [k] zfs_write ▒
- zfs_write ▒
- 43.26% dmu_tx_assign ▒
+ 36.18% dmu_tx_wait ▒
+ 1.69% txg_hold_open ▒
+ 1.43% dsl_dir_tempreserve_space ▒
+ 0.48% mutex_lock ▒
0.42% mutex_unlock ▒
+ 0.41% txg_rele_to_sync ▒
+ 0.39% dsl_pool_need_dirty_delay ▒
+ 0.36% dsl_dataset_prev_snap_txg ▒
0.36% _raw_spin_lock ▒
0.17% __cv_broadcast ▒
+ 0.15% txg_rele_to_quiesce ▒
0.08% spa_last_synced_txg ▒
0.07% arc_tempreserve_space ▒
0.04% _cond_resched ▒
0.03% spl_kmem_alloc ▒
0.02% spa_tx_assign_add_nsecs ▒
+ 0.02% apic_timer_interrupt ▒
0.02% getrawmonotonic64 ▒
0.02% txg_wait_open ▒
0.01% spa_get_asize ▒
0.01% spa_suspended ▒
0.01% dsl_dir_tempreserve_clear ▒
0.37% mutex_unlock ▒
0.03% spa_suspended ▒
0.02% mutex_lock ▒
0.02% dsl_dir_tempreserve_space ▒
0.02% _raw_spin_lock ▒
0.02% dsl_dataset_prev_snap_txg ▒
0.02% dsl_pool_need_dirty_delay ▒
+ 0.01% apic_timer_interrupt ▒
0.01% spa_get_asize ▒
0.01% txg_hold_open ▒
0.01% txg_rele_to_quiesce ▒
0.01% dmu_tx_wait ▒
0.00% __cv_broadcast ▒
0.00% txg_rele_to_sync ▒
- 43.26% 0.92% init [kernel.kallsyms] [k] dmu_tx_assign ▒
- 42.34% dmu_tx_assign ▒
- 36.18% dmu_tx_wait ▒
- 34.94% txg_wait_open ▒
- 22.39% __cv_wait ▒
- 22.12% cv_wait_common ▒
- 10.28% mutex_lock ▒
- 10.07% __mutex_lock_slowpath ▒
- 9.97% mutex_optimistic_spin ▒
5.30% mutex_spin_on_owner.isra.3 ▒
4.13% osq_lock ▒
0.09% osq_unlock ▒
+ 0.08% apic_timer_interrupt ▒
+ 0.05% schedule_preempt_disabled ▒
+ 0.00% ret_from_intr ▒
0.00% schedule ▒
+ 0.05% schedule_preempt_disabled ▒
0.01% mutex_spin_on_owner.isra.3 ▒
0.01% osq_lock ▒
0.01% osq_unlock ▒
+ 0.01% _raw_spin_lock ▒
+ 0.00% apic_timer_interrupt ▒
0.00% smp_reschedule_interrupt ▒
0.00% mutex_optimistic_spin ▒
0.00% _raw_spin_lock ▒
- 10.05% schedule ▒
- 9.90% __schedule ▒
- 4.82% deactivate_task ▒
- 4.45% dequeue_task_fair ▒
+ 3.77% dequeue_entity ▒
0.09% update_cfs_shares ▒
0.07% account_entity_dequeue ▒
0.04% update_curr ▒
0.03% clear_buddies ▒
0.03% set_next_buddy ▒
0.02% update_min_vruntime ▒
0.00% __compute_runnable_contrib ▒
+ 0.20% update_rq_clock.part.80 ▒
0.03% dequeue_entity ▒
0.01% sched_clock_cpu ▒
0.01% update_cfs_shares ▒
0.01% set_next_buddy ▒
0.01% hrtick_update ▒
- 3.23% pick_next_task_fair ▒
+ 1.70% set_next_entity ▒
+ 0.30% pick_next_entity ▒
also now i see txg_quiesce going up to 90%
- 18.28% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper ▒
- thread_generic_wrapper ▒
- 16.56% txg_quiesce_thread ▒
+ 8.84% __cv_broadcast ▒
+ 2.03% __cv_wait_sig ▒
+ 1.87% mutex_lock ▒
+ 1.12% _raw_spin_lock ▒
0.87% __bitmap_weight ▒
Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
Children Self Command Shared Object Symbol ▒
- 18.28% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper ◆
- thread_generic_wrapper ▒
- 16.56% txg_quiesce_thread ▒
- 8.84% __cv_broadcast ▒
- 8.70% __wake_up ▒
- 8.53% __wake_up_common ▒
- 8.32% autoremove_wake_function ▒
- 8.21% default_wake_function ▒
- 8.11% try_to_wake_up ▒
- 5.23% ttwu_do_activate.constprop.90 ▒
- 4.56% activate_task ▒
- 4.14% enqueue_task_fair ▒
- 3.25% enqueue_entity ▒
- 1.22% update_cfs_shares ▒
- 0.33% update_curr ▒
0.16% __calc_delta ▒
0.02% update_min_vruntime ▒
0.03% account_entity_dequeue ▒
0.03% account_entity_enqueue ▒
0.02% __calc_delta ▒
0.01% update_min_vruntime ▒
+ 0.46% update_curr ▒
+ 0.23% __enqueue_entity
Samples: 238K of event 'cycles:pp', Event count (approx.): 177373524530
Children Self Command Shared Object Symbol ◆
- 18.28% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper ▒
- thread_generic_wrapper ▒
- 16.56% txg_quiesce_thread ▒
+ 8.84% __cv_broadcast ▒
- 2.03% __cv_wait_sig ▒
- 1.96% cv_wait_common ▒
- 0.92% schedule ▒
+ 0.87% __schedule ▒
0.01% pick_next_task_idle ▒
0.01% _raw_spin_lock_irq ▒
0.00% pick_next_task_fair ▒
0.00% deactivate_task ▒
0.00% rcu_note_context_switch ▒
0.00% finish_task_switch ▒
+ 0.00% apic_timer_interrupt ▒
+ 0.43% mutex_lock ▒
+ 0.15% mutex_unlock ▒
+ 0.09% prepare_to_wait_exclusive ▒
+ 0.09% __wake_up ▒
+ 0.04% _raw_spin_lock ▒
0.01% _raw_spin_unlock_irqrestore ▒
+ 0.01% finish_wait ▒
0.01% _raw_spin_lock_irqsave ▒
0.01% _cond_resched ▒
0.00% __mutex_unlock_slowpath ▒
0.00% __mutex_lock_slowpath ▒
0.00% __schedule ▒
0.00% __wake_up_common ▒
+ 0.00% apic_timer_interrupt ▒
0.05% mutex_unlock ▒
0.01% mutex_lock ▒
0.00% schedule ▒
0.00% finish_wait ▒
0.00% prepare_to_wait_exclusive ▒
0.00% __wake_up ▒
+ 0.00% apic_timer_interrupt ▒
0.00% _raw_spin_lock ▒
- 1.87% mutex_lock ▒
- 0.39% __mutex_lock_slowpath ▒
- 0.38% mutex_optimistic_spin ▒
0.21% mutex_spin_on_owner.isra.3 ▒
0.05% osq_lock ▒
0.02% osq_unlock ▒
- 0.01% apic_timer_interrupt ▒
- smp_apic_timer_interrupt ▒
- 0.01% irq_exit ▒
__do_softirq ▒
run_rebalance_domains ▒
rebalance_domains ▒
update_blocked_averages ▒
- 0.00% local_apic_timer_interrupt ▒
- hrtimer_interrupt ▒
- ktime_get_update_offsets_now ▒
read_tsc
# cat /proc/spl/kstat/zfs/arcstats
11 1 0x01 92 4416 6608467357140 10345217200879
name type data
hits 4 1526265
misses 4 1841784
demand_data_hits 4 468546
demand_data_misses 4 88614
demand_metadata_hits 4 357372
demand_metadata_misses 4 1599485
prefetch_data_hits 4 720
prefetch_data_misses 4 175
prefetch_metadata_hits 4 699627
prefetch_metadata_misses 4 153510
mru_hits 4 102644
mru_ghost_hits 4 48898
mfu_hits 4 725001
mfu_ghost_hits 4 68485
deleted 4 217533
mutex_miss 4 1132345
evict_skip 4 205204542
evict_not_enough 4 2195789
evict_l2_cached 4 0
evict_l2_eligible 4 5206740992
evict_l2_ineligible 4 1047908352
evict_l2_skip 4 0
hash_elements 4 29114
hash_elements_max 4 75820
hash_collisions 4 17787
hash_chains 4 404
hash_chain_max 4 3
p 4 1556864
c 4 33554432
c_min 4 33554432
c_max 4 4111763456
size 4 1154707936
compressed_size 4 332038656
uncompressed_size 4 751435776
overhead_size 4 582264832
hdr_size 4 12791456
data_size 4 94294016
metadata_size 4 820009472
dbuf_size 4 50510512
dnode_size 4 133581840
bonus_size 4 43520640
anon_size 4 311985152
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 468071424
mru_evictable_data 4 0
mru_evictable_metadata 4 0
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 134246912
mfu_evictable_data 4 0
mfu_evictable_metadata 4 0
mfu_ghost_size 4 33021440
mfu_ghost_evictable_data 4 262144
mfu_ghost_evictable_metadata 4 32759296
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
memory_throttle_count 4 0
memory_direct_count 4 2689
memory_indirect_count 4 3485
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 1284666
arc_meta_used 4 1060413920
arc_meta_limit 4 3083822592
arc_dnode_limit 4 308382259
arc_meta_max 4 1450769152
arc_meta_min 4 16777216
sync_wait_for_async 4 7222
demand_hit_predictive_prefetch 4 34318
arc_need_free 4 0
arc_sys_free 4 128491520
i can always trigger this by simultaneously starting strongswan inside 600 identical lxd containers (all appear to be zfs clones). inside a container there is ubuntu trusty, doing a normal init (i've disabled as many boot services as i could, and after a normal startup it shows only init, dhclient and systemd-udev processes)
@mailinglists35 With #5425 you don't see ddt_get_dedup_dspace
and friends in your traces, so the patch is worth keeping. But, due to nature of your workload you continue to experience a lot of lock contention.
If you're inclined to do more testing, I have #5428 and zfsonlinux/spl#587 (They must go together)
which should decrease mutex overhead, in my limited testing by about 10 - 14%.
@ironMann thank you. I have applied pr 5428 and 587 on top of existing 5412+5425.
there is no change in cpu consumption. perf record -ag sleep 10
pasted below
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2
15023 1 0.98s 0.00s 0K 0K 0K 0K -- - R 1 100% txg_sync
15022 1 0.89s 0.00s 0K 0K 0K 0K -- - R 3 92% txg_quiesce
Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833
Children Self Command Shared Object Symbol
+ 44.93% 0.00% init [kernel.kallsyms] [k] zfs_write
+ 44.93% 0.00% init [kernel.kallsyms] [k] zpl_write_common_iovec
+ 44.93% 0.00% init [kernel.kallsyms] [k] zpl_write
+ 44.93% 0.00% init [kernel.kallsyms] [k] __vfs_write
+ 44.93% 0.00% init [kernel.kallsyms] [k] vfs_write
+ 44.93% 0.00% init [kernel.kallsyms] [k] sys_write
+ 44.93% 0.00% init [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
+ 44.42% 1.30% init [kernel.kallsyms] [k] dmu_tx_assign
+ 37.59% 0.19% init [kernel.kallsyms] [k] dmu_tx_wait
+ 36.50% 0.48% init [kernel.kallsyms] [k] txg_wait_open
+ 25.95% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper
+ 25.95% 0.00% txg_sync [kernel.kallsyms] [k] kthread
+ 25.95% 0.00% txg_sync [kernel.kallsyms] [k] ret_from_fork
+ 25.75% 0.68% txg_sync [kernel.kallsyms] [k] txg_sync_thread
+ 24.38% 1.42% init [kernel.kallsyms] [k] mutex_lock
+ 22.98% 0.09% init [kernel.kallsyms] [k] __mutex_lock_slowpath
+ 22.71% 0.71% init [kernel.kallsyms] [k] mutex_optimistic_spin
+ 21.74% 0.03% init [kernel.kallsyms] [k] __cv_wait
+ 21.40% 0.83% init [kernel.kallsyms] [k] cv_wait_common
+ 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper
+ 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] kthread
+ 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] ret_from_fork
+ 17.58% 0.35% txg_sync [kernel.kallsyms] [k] spa_sync
+ 16.88% 3.09% txg_quiesce [kernel.kallsyms] [k] txg_quiesce_thread
+ 12.22% 12.22% init [kernel.kallsyms] [k] mutex_spin_on_owner.isra.3
+ 9.97% 1.49% init [kernel.kallsyms] [k] __schedule
+ 9.74% 0.23% init [kernel.kallsyms] [k] schedule
+ 9.39% 9.39% init [kernel.kallsyms] [k] osq_lock
+ 8.92% 0.11% txg_quiesce [kernel.kallsyms] [k] __cv_broadcast
+ 8.89% 0.03% txg_quiesce [kernel.kallsyms] [k] __wake_up
+ 8.54% 0.22% txg_quiesce [kernel.kallsyms] [k] __wake_up_common
+ 8.31% 0.12% txg_quiesce [kernel.kallsyms] [k] autoremove_wake_function
+ 8.20% 0.02% txg_quiesce [kernel.kallsyms] [k] default_wake_function
+ 8.16% 0.64% txg_quiesce [kernel.kallsyms] [k] try_to_wake_up
+ 7.63% 2.02% txg_sync [kernel.kallsyms] [k] mutex_lock
+ 7.58% 0.12% swapper [kernel.kallsyms] [k] cpu_startup_entry
+ 5.61% 0.01% txg_sync [kernel.kallsyms] [k] __mutex_lock_slowpath
+ 5.58% 0.18% txg_sync [kernel.kallsyms] [k] mutex_optimistic_spin
+ 5.49% 0.02% txg_sync [kernel.kallsyms] [k] zap_contains
+ 5.46% 0.03% txg_sync [kernel.kallsyms] [k] zap_lookup_norm
+ 5.28% 0.06% txg_quiesce [kernel.kallsyms] [k] ttwu_do_activate.constprop.90
+ 4.76% 0.00% swapper [kernel.kallsyms] [k] start_secondary
+ 4.70% 4.70% txg_sync [kernel.kallsyms] [k] osq_lock
+ 4.62% 0.22% txg_quiesce [kernel.kallsyms] [k] activate_task
+ 4.59% 0.12% init [kernel.kallsyms] [k] deactivate_task
+ 4.24% 0.32% init [kernel.kallsyms] [k] dequeue_task_fair
+ 4.18% 0.35% txg_sync [kernel.kallsyms] [k] zio_wait
+ 4.18% 0.34% txg_quiesce [kernel.kallsyms] [k] enqueue_task_fair
+ 3.75% 0.11% txg_sync [kernel.kallsyms] [k] dsl_pool_sync
+ 3.68% 0.76% init [kernel.kallsyms] [k] dequeue_entity
+ 3.34% 0.98% txg_quiesce [kernel.kallsyms] [k] enqueue_entity
+ 3.17% 0.77% init [kernel.kallsyms] [k] pick_next_task_fair
+ 3.08% 0.14% txg_sync [kernel.kallsyms] [k] ddt_sync
+ 3.01% 1.17% swapper [kernel.kallsyms] [k] hrtimer_start_range_ns
+ 2.89% 0.00% swapper [kernel.kallsyms] [k] rest_init
+ 2.89% 0.00% swapper [kernel.kallsyms] [k] start_kernel
+ 2.89% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_reservations
+ 2.89% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_kernel
+ 2.70% 0.03% txg_sync [kernel.kallsyms] [k] zap_lockdir
+ 2.65% 0.03% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
+ 2.56% 0.13% swapper [kernel.kallsyms] [k] __tick_nohz_idle_enter
+ 2.37% 0.03% txg_sync [kernel.kallsyms] [k] dmu_buf_hold
+ 2.34% 0.68% init [kernel.kallsyms] [k] __cv_broadcast
+ 2.31% 0.02% txg_sync [kernel.kallsyms] [k] zap_lookup_impl
+ 2.30% 1.43% txg_quiesce [kernel.kallsyms] [k] mutex_lock
+ 2.23% 0.07% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
+ 2.21% 0.01% txg_quiesce [kernel.kallsyms] [k] __cv_wait_sig
+ 2.14% 0.04% txg_sync [kernel.kallsyms] [k] dmu_buf_hold_noread
+ 2.14% 0.28% txg_quiesce [kernel.kallsyms] [k] cv_wait_common
+ 2.00% 0.47% txg_sync [kernel.kallsyms] [k] zio_done
+ 1.97% 0.02% swapper [kernel.kallsyms] [k] tick_nohz_restart
+ 1.91% 1.82% txg_sync [kernel.kallsyms] [k] mutex_unlock
+ 1.78% 0.00% init libc-2.19.so [.] 0xffff80b517d9d710
+ 1.77% 0.03% txg_sync [kernel.kallsyms] [k] fzap_lookup
+ 1.69% 0.00% init libc-2.19.so [.] 0xffff8065c7dcc710
+ 1.68% 0.88% init [kernel.kallsyms] [k] update_cfs_shares
+ 1.67% 0.32% init [kernel.kallsyms] [k] set_next_entity
+ 1.66% 0.04% init [kernel.kallsyms] [k] __wake_up
+ 1.66% 0.00% init [unknown] [k] 0x00005642200e6170
For a higher level overview, try: perf report --sort comm,dso
Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833
Children Self Command Shared Object Symbol ◆
- 44.93% 0.00% init [kernel.kallsyms] [k] zfs_write ▒
- zfs_write ▒
- 44.42% dmu_tx_assign ▒
- 37.58% dmu_tx_wait ▒
- 36.46% txg_wait_open ▒
- 21.71% __cv_wait ▒
- 21.39% cv_wait_common ▒
- 10.38% mutex_lock ▒
- 10.19% __mutex_lock_slowpath ▒
- 10.10% mutex_optimistic_spin ▒
4.86% mutex_spin_on_owner.isra.3 ▒
4.79% osq_lock ▒
0.08% osq_unlock ▒
+ 0.05% schedule_preempt_disabled ▒
+ 0.04% apic_timer_interrupt ▒
0.00% ret_from_intr ▒
+ 0.04% schedule_preempt_disabled ▒
+ 0.01% _raw_spin_lock ▒
0.01% mutex_spin_on_owner.isra.3 ▒
0.01% osq_unlock ▒
0.00% osq_lock ▒
+ 0.00% reschedule_interrupt ▒
+ 0.00% apic_timer_interrupt ▒
0.00% mutex_optimistic_spin ▒
- 9.38% schedule ▒
- 9.23% __schedule ▒
- 4.55% deactivate_task ▒
- 4.19% dequeue_task_fair ▒
- 3.61% dequeue_entity ▒
- 1.54% update_cfs_shares ▒
- 0.66% update_curr ▒
0.20% __calc_delta ▒
0.13% update_min_vruntime ▒
0.05% account_entity_enqueue ▒
0.04% account_entity_dequeue ▒
0.02% __calc_delta ▒
0.02% update_min_vruntime ▒
+ 0.73% update_curr ▒
0.29% account_entity_dequeue ▒
0.22% cpuacct_charge ▒
0.06% update_min_vruntime ▒
0.03% account_entity_enqueue ▒
0.02% clear_buddies ▒
0.09% update_cfs_shares ▒
0.07% account_entity_dequeue ▒
0.04% set_next_buddy ▒
0.03% update_curr ▒
0.02% update_min_vruntime ▒
0.02% clear_buddies ▒
0.00% __compute_runnable_contrib ▒
+ 0.18% update_rq_clock.part.80 ▒
0.03% dequeue_entity ▒
0.01% sched_clock_cpu ▒
0.01% update_cfs_shares ▒
0.01% set_next_buddy ▒
0.00% hrtick_update ▒
+ 3.06% pick_next_task_fair ▒
+ 0.08% _raw_spin_lock_irq ▒
0.07% finish_task_switch ▒
+ 0.05% apic_timer_interrupt ▒
0.03% pick_next_entity ▒
0.03% put_prev_entity ▒
0.02% check_cfs_rq_runtime ▒
0.02% set_next_entity ▒
0.01% rcu_note_context_switch ▒
0.01% dequeue_task_fair ▒
0.01% update_curr ▒
0.01% update_rq_clock.part.80 ▒
+ 0.00% pick_next_task_idle ▒
+ 0.00% reschedule_interrupt ▒
0.00% __msecs_to_jiffies ▒
0.00% smp_reschedule_interrupt ▒
0.00% ret_from_intr ▒
0.01% rcu_note_context_switch ▒
0.01% _raw_spin_lock_irq ▒
0.01% pick_next_task_fair ▒
0.01% finish_task_switch ▒
0.01% deactivate_task ▒
0.00% pick_next_task_idle ▒
+ 0.48% prepare_to_wait_exclusive ▒
+ 0.22% mutex_unlock ▒
0.02% __schedule ▒
+ 0.02% finish_wait ▒
0.02% _raw_spin_unlock_irqrestore ▒
0.01% __mutex_lock_slowpath ▒
0.01% _cond_resched ▒
0.00% _raw_spin_lock_irqsave ▒
0.00% __mutex_unlock_slowpath ▒
+ 0.00% apic_timer_interrupt ▒
0.18% mutex_unlock ▒
0.12% schedule ▒
0.01% prepare_to_wait_exclusive ▒
0.01% finish_wait ▒
0.01% mutex_lock ▒
+ 0.01% apic_timer_interrupt ▒
- 12.11% mutex_lock ▒
- 11.86% __mutex_lock_slowpath ▒
- 11.73% mutex_optimistic_spin ▒
6.61% mutex_spin_on_owner.isra.3 ▒
4.57% osq_lock ▒
0.10% osq_unlock ▒
+ 0.07% schedule_preempt_disabled ▒
+ 0.05% apic_timer_interrupt ▒
+ 0.07% schedule_preempt_disabled ▒
+ 0.02% _raw_spin_lock ▒
0.01% osq_lock ▒
0.01% mutex_spin_on_owner.isra.3 ▒
0.00% osq_unlock ▒
and expanded call path for txg_quiesce
- 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper ▒
- thread_generic_wrapper ▒
- 16.88% txg_quiesce_thread ▒
- 8.91% __cv_broadcast ▒
- 8.80% __wake_up ▒
- 8.53% __wake_up_common ◆
- 8.31% autoremove_wake_function ▒
- 8.19% default_wake_function ▒
- 8.10% try_to_wake_up ▒
- 5.24% ttwu_do_activate.constprop.90 ▒
+ 4.59% activate_task ▒
- 0.58% ttwu_do_wakeup ▒
- 0.52% check_preempt_curr ▒
- 0.36% check_preempt_wakeup ▒
- 0.10% wakeup_preempt_entity.isra.55 ▒
__calc_delta ▒
- 0.04% update_curr ▒
cpuacct_charge ▒
- 0.02% resched_curr ▒
- 0.01% native_smp_send_reschedule ▒
- physflat_send_IPI_mask ▒
- default_send_IPI_mask_sequence_phys ▒
- find_next_bit ▒
_find_next_bit.part.0 ▒
0.00% physflat_send_IPI_mask ▒
0.01% __calc_delta ▒
0.00% set_next_buddy ▒
0.01% update_curr ▒
0.01% wakeup_preempt_entity.isra.55 ▒
0.00% resched_curr ▒
0.01% resched_curr ▒
0.01% check_preempt_wakeup ▒
0.01% enqueue_task_fair ▒
0.01% check_preempt_curr ▒
0.01% update_rq_clock.part.80 ▒
- 1.20% select_task_rq_fair ▒
- 0.92% select_idle_sibling ▒
0.15% idle_cpu ▒
- 0.11% find_next_bit ▒
_find_next_bit.part.0 ▒
0.07% __bitmap_intersects ▒
0.07% _find_next_bit.part.0 ▒
0.00% cpumask_next_and ▒
0.06% find_next_bit ▒
0.05% idle_cpu ▒
0.05% __bitmap_intersects ▒
- 0.55% _raw_spin_lock ▒
native_queued_spin_lock_slowpath ▒
0.20% _raw_spin_lock_irqsave ▒
0.14% ttwu_stat ▒
0.05% task_waking_fair ▒
0.04% _raw_spin_unlock_irqrestore ▒
- 0.01% ttwu_do_wakeup ▒
- check_preempt_curr ▒
check_preempt_wakeup ▒
0.01% select_idle_sibling ▒
0.01% activate_task ▒
0.00% native_queued_spin_lock_slowpath ▒
- 0.00% set_task_cpu ▒
- migrate_task_rq_fair ▒
remove_entity_load_avg ▒
0.02% ttwu_stat ▒
0.02% _raw_spin_unlock_irqrestore ▒
0.01% task_waking_fair ▒
0.01% ttwu_do_activate.constprop.90 ▒
0.01% _raw_spin_lock ▒
0.01% select_task_rq_fair ▒
0.00% _raw_spin_lock_irqsave ▒
0.00% set_task_cpu ▒
0.00% try_to_wake_up ▒
0.01% default_wake_function ▒
- 0.20% apic_timer_interrupt ▒
- smp_apic_timer_interrupt ▒
- 0.20% irq_exit ▒
- __do_softirq ▒
- 0.19% run_rebalance_domains ▒
- rebalance_domains ▒
0.19% update_blocked_averages ▒
0.00% __compute_runnable_contrib ▒
- 0.00% load_balance ▒
deactivate_task ▒
dequeue_task_fair ▒
dequeue_entity ▒
0.00% cpu_needs_another_gp ▒
- 0.00% local_apic_timer_interrupt ▒
- hrtimer_interrupt ▒
- __hrtimer_run_queues ▒
- 0.00% tick_sched_timer ▒
tick_sched_handle.isra.14 ▒
- update_process_times ▒
0.00% rcu_check_callbacks ▒
0.00% _raw_spin_lock ▒
- 0.00% hrtimer_wakeup ▒
wake_up_process ▒
try_to_wake_up ▒
_raw_spin_lock_irqsave ▒
0.05% _raw_spin_lock_irqsave ▒
0.01% _raw_spin_unlock_irqrestore ▒
0.01% autoremove_wake_function ▒
+ 0.01% ret_from_intr ▒
0.00% __wake_up_common ▒
0.00% _raw_spin_unlock_irqrestore ▒
- 2.20% __cv_wait_sig ▒
- 2.14% cv_wait_common ▒
- 0.88% schedule ▒
- 0.85% __schedule ▒
- 0.39% deactivate_task ▒
- 0.31% dequeue_task_fair ▒
- 0.25% dequeue_entity ▒
- 0.07% update_curr ▒
0.02% cpuacct_charge ▒
0.01% update_min_vruntime ▒
0.03% account_entity_dequeue ▒
0.01% update_cfs_shares ▒
0.01% cpuacct_charge ▒
0.00% clear_buddies ▒
0.00% update_min_vruntime ▒
0.02% account_entity_dequeue ▒
0.01% clear_buddies ▒
0.00% update_min_vruntime ▒
0.00% update_curr ▒
0.00% update_cfs_shares ▒
- 0.05% update_rq_clock.part.80 ▒
- sched_clock_cpu ▒
- 0.01% sched_clock ▒
native_sched_clock ▒
0.00% native_sched_clock ▒
0.00% dequeue_entity ▒
0.00% hrtick_update ▒
0.00% sched_clock_cpu ▒
+ 0.08% pick_next_task_fair ▒
0.06% finish_task_switch ▒
+ 0.03% _raw_spin_lock_irq ▒
+ 0.03% pick_next_task_idle ▒
0.01% rcu_note_context_switch ▒
0.01% dequeue_task_fair ▒
0.01% __msecs_to_jiffies ▒
0.00% update_rq_clock.part.80 ▒
0.00% put_prev_task_fair ▒
+ 0.00% ret_from_intr ▒
+ 0.00% apic_timer_interrupt ▒
0.00% update_curr ▒
0.00% check_cfs_rq_runtime ▒
+ 0.00% reschedule_interrupt ▒
0.01% _raw_spin_lock_irq ▒
0.01% deactivate_task ▒
0.00% pick_next_task_idle ▒
0.00% pick_next_task_fair ▒
0.00% finish_task_switch ▒
0.00% rcu_note_context_switch ▒
+ 0.00% apic_timer_interrupt ▒
0.00% update_rq_clock.part.80 ▒
+ 0.62% mutex_lock ▒
+ 0.15% mutex_unlock ▒
+ 0.09% prepare_to_wait_exclusive ▒
+ 0.08% __wake_up ▒
+ 0.01% finish_wait ▒
0.01% _raw_spin_unlock_irqrestore ▒
0.01% __mutex_lock_slowpath ▒
0.00% __schedule ▒
0.00% __mutex_unlock_slowpath ▒
0.00% _cond_resched ▒
0.00% _raw_spin_lock_irqsave ▒
0.00% __wake_up_common ▒
+ 0.00% ret_from_intr ▒
0.04% mutex_unlock ▒
0.01% mutex_lock ▒
+ 0.01% apic_timer_interrupt ▒
0.00% schedule ▒
0.00% prepare_to_wait_exclusive ▒
0.00% __wake_up ▒
0.00% finish_wait ▒
- 1.57% mutex_lock ▒
- 0.31% __mutex_lock_slowpath ▒
- 0.30% mutex_optimistic_spin ▒
0.15% mutex_spin_on_owner.isra.3 ▒
0.05% osq_lock ▒
- 0.01% apic_timer_interrupt ▒
- smp_apic_timer_interrupt ▒
irq_exit ▒
__do_softirq ▒
run_rebalance_domains ▒
rebalance_domains ▒
update_blocked_averages ▒
0.01% osq_unlock ▒
0.00% mutex_spin_on_owner.isra.3 ▒
0.00% osq_lock ▒
0.00% osq_unlock ▒
+ 0.00% schedule_preempt_disabled ▒
0.00% mutex_optimistic_spin ▒
On Nov 29, 2016, at 8:10 AM, mailinglists35 notifications@github.com wrote:
@ironMann https://github.com/ironMann thank you. I have applied pr 5428 and 587 on top of existing 5412+5425. there is no change in cpu consumption. perf record -ag sleep 10 pasted below
Are you sure you’re not bumping up against quota? One way to check is to see how often the txgs close and the time taken. To do this:
echo 20 >> /sys/module/zfs/parameters/zfs_txg_history cat /proc/spl/kstat/zfs/POOLNAME/txgs
The "stime" column shows the sync time in nanoseconds and the "otime" column shows the open time, which is nominally 5 seconds, by default.
If otime is very short, then it might be busy trying to keep quotas under control or perhaps something is calling sync() in a tight loop (or a few other, less common conditions) — richard
PID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 15023 1 0.98s 0.00s 0K 0K 0K 0K -- - R 1 100% txg_sync 15022 1 0.89s 0.00s 0K 0K 0K 0K -- - R 3 92% txg_quiesce
Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833 Children Self Command Shared Object Symbol
- 44.93% 0.00% init [kernel.kallsyms] [k] zfs_write
- 44.93% 0.00% init [kernel.kallsyms] [k] zpl_write_common_iovec
- 44.93% 0.00% init [kernel.kallsyms] [k] zpl_write
- 44.93% 0.00% init [kernel.kallsyms] [k] __vfs_write
- 44.93% 0.00% init [kernel.kallsyms] [k] vfs_write
- 44.93% 0.00% init [kernel.kallsyms] [k] sys_write
- 44.93% 0.00% init [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
- 44.42% 1.30% init [kernel.kallsyms] [k] dmu_tx_assign
- 37.59% 0.19% init [kernel.kallsyms] [k] dmu_tx_wait
- 36.50% 0.48% init [kernel.kallsyms] [k] txg_wait_open
- 25.95% 0.00% txg_sync [kernel.kallsyms] [k] thread_generic_wrapper
- 25.95% 0.00% txg_sync [kernel.kallsyms] [k] kthread
- 25.95% 0.00% txg_sync [kernel.kallsyms] [k] ret_from_fork
- 25.75% 0.68% txg_sync [kernel.kallsyms] [k] txg_sync_thread
- 24.38% 1.42% init [kernel.kallsyms] [k] mutex_lock
- 22.98% 0.09% init [kernel.kallsyms] [k] __mutex_lock_slowpath
- 22.71% 0.71% init [kernel.kallsyms] [k] mutex_optimistic_spin
- 21.74% 0.03% init [kernel.kallsyms] [k] __cv_wait
- 21.40% 0.83% init [kernel.kallsyms] [k] cv_wait_common
- 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] thread_generic_wrapper
- 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] kthread
- 18.52% 0.00% txg_quiesce [kernel.kallsyms] [k] ret_from_fork
- 17.58% 0.35% txg_sync [kernel.kallsyms] [k] spa_sync
- 16.88% 3.09% txg_quiesce [kernel.kallsyms] [k] txg_quiesce_thread
- 12.22% 12.22% init [kernel.kallsyms] [k] mutex_spin_on_owner.isra.3
- 9.97% 1.49% init [kernel.kallsyms] [k] __schedule
- 9.74% 0.23% init [kernel.kallsyms] [k] schedule
- 9.39% 9.39% init [kernel.kallsyms] [k] osq_lock
- 8.92% 0.11% txg_quiesce [kernel.kallsyms] [k] __cv_broadcast
- 8.89% 0.03% txg_quiesce [kernel.kallsyms] [k] __wake_up
- 8.54% 0.22% txg_quiesce [kernel.kallsyms] [k] __wake_up_common
- 8.31% 0.12% txg_quiesce [kernel.kallsyms] [k] autoremove_wake_function
- 8.20% 0.02% txg_quiesce [kernel.kallsyms] [k] default_wake_function
- 8.16% 0.64% txg_quiesce [kernel.kallsyms] [k] try_to_wake_up
- 7.63% 2.02% txg_sync [kernel.kallsyms] [k] mutex_lock
- 7.58% 0.12% swapper [kernel.kallsyms] [k] cpu_startup_entry
- 5.61% 0.01% txg_sync [kernel.kallsyms] [k] __mutex_lock_slowpath
- 5.58% 0.18% txg_sync [kernel.kallsyms] [k] mutex_optimistic_spin
- 5.49% 0.02% txg_sync [kernel.kallsyms] [k] zap_contains
- 5.46% 0.03% txg_sync [kernel.kallsyms] [k] zap_lookup_norm
- 5.28% 0.06% txg_quiesce [kernel.kallsyms] [k] ttwu_do_activate.constprop.90
- 4.76% 0.00% swapper [kernel.kallsyms] [k] start_secondary
- 4.70% 4.70% txg_sync [kernel.kallsyms] [k] osq_lock
- 4.62% 0.22% txg_quiesce [kernel.kallsyms] [k] activate_task
- 4.59% 0.12% init [kernel.kallsyms] [k] deactivate_task
- 4.24% 0.32% init [kernel.kallsyms] [k] dequeue_task_fair
- 4.18% 0.35% txg_sync [kernel.kallsyms] [k] zio_wait
- 4.18% 0.34% txg_quiesce [kernel.kallsyms] [k] enqueue_task_fair
- 3.75% 0.11% txg_sync [kernel.kallsyms] [k] dsl_pool_sync
- 3.68% 0.76% init [kernel.kallsyms] [k] dequeue_entity
- 3.34% 0.98% txg_quiesce [kernel.kallsyms] [k] enqueue_entity
- 3.17% 0.77% init [kernel.kallsyms] [k] pick_next_task_fair
- 3.08% 0.14% txg_sync [kernel.kallsyms] [k] ddt_sync
- 3.01% 1.17% swapper [kernel.kallsyms] [k] hrtimer_start_range_ns
- 2.89% 0.00% swapper [kernel.kallsyms] [k] rest_init
- 2.89% 0.00% swapper [kernel.kallsyms] [k] start_kernel
- 2.89% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_reservations
- 2.89% 0.00% swapper [kernel.kallsyms] [k] x86_64_start_kernel
- 2.70% 0.03% txg_sync [kernel.kallsyms] [k] zap_lockdir
- 2.65% 0.03% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
- 2.56% 0.13% swapper [kernel.kallsyms] [k] __tick_nohz_idle_enter
- 2.37% 0.03% txg_sync [kernel.kallsyms] [k] dmu_buf_hold
- 2.34% 0.68% init [kernel.kallsyms] [k] __cv_broadcast
- 2.31% 0.02% txg_sync [kernel.kallsyms] [k] zap_lookup_impl
- 2.30% 1.43% txg_quiesce [kernel.kallsyms] [k] mutex_lock
- 2.23% 0.07% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
- 2.21% 0.01% txg_quiesce [kernel.kallsyms] [k] __cv_wait_sig
- 2.14% 0.04% txg_sync [kernel.kallsyms] [k] dmu_buf_hold_noread
- 2.14% 0.28% txg_quiesce [kernel.kallsyms] [k] cv_wait_common
- 2.00% 0.47% txg_sync [kernel.kallsyms] [k] zio_done
- 1.97% 0.02% swapper [kernel.kallsyms] [k] tick_nohz_restart
- 1.91% 1.82% txg_sync [kernel.kallsyms] [k] mutex_unlock
- 1.78% 0.00% init libc-2.19.so [.] 0xffff80b517d9d710
- 1.77% 0.03% txg_sync [kernel.kallsyms] [k] fzap_lookup
- 1.69% 0.00% init libc-2.19.so [.] 0xffff8065c7dcc710
- 1.68% 0.88% init [kernel.kallsyms] [k] update_cfs_shares
- 1.67% 0.32% init [kernel.kallsyms] [k] set_next_entity
- 1.66% 0.04% init [kernel.kallsyms] [k] __wake_up
- 1.66% 0.00% init [unknown] [k] 0x00005642200e6170 For a higher level overview, try: perf report --sort comm,dso
Samples: 119K of event 'cycles:pp', Event count (approx.): 88912982833 Children Self Command Shared Object Symbol ◆
- 44.93% 0.00% init [kernel.kallsyms] [k] zfs_write ▒
- zfs_write ▒
- 44.42% dmu_tx_assign ▒
- 37.58% dmu_tx_wait ▒
- 36.46% txg_wait_open ▒
- 21.71% __cv_wait ▒
- 21.39% cv_wait_common ▒
- 10.38% mutex_lock ▒
- 10.19% __mutex_lock_slowpath ▒
- 10.10% mutex_optimistic_spin ▒ 4.86% mutex_spin_on_owner.isra.3 ▒ 4.79% osq_lock ▒ 0.08% osq_unlock ▒
- 0.05% schedule_preempt_disabled ▒
- 0.04% apic_timer_interrupt ▒ 0.00% ret_from_intr ▒
- 0.04% schedule_preempt_disabled ▒
- 0.01% _raw_spin_lock ▒ 0.01% mutex_spin_on_owner.isra.3 ▒ 0.01% osq_unlock ▒ 0.00% osq_lock ▒
- 0.00% reschedule_interrupt ▒
- 0.00% apic_timer_interrupt ▒ 0.00% mutex_optimistic_spin ▒
- 9.38% schedule ▒
- 9.23% __schedule ▒
- 4.55% deactivate_task ▒
- 4.19% dequeue_task_fair ▒
- 3.61% dequeue_entity ▒
- 1.54% update_cfs_shares ▒
- 0.66% update_curr ▒ 0.20% __calc_delta ▒ 0.13% update_min_vruntime ▒ 0.05% account_entity_enqueue ▒ 0.04% account_entity_dequeue ▒ 0.02% __calc_delta ▒ 0.02% update_min_vruntime ▒
- 0.73% update_curr ▒ 0.29% account_entity_dequeue ▒ 0.22% cpuacct_charge ▒ 0.06% update_min_vruntime ▒ 0.03% account_entity_enqueue ▒ 0.02% clear_buddies ▒ 0.09% update_cfs_shares ▒ 0.07% account_entity_dequeue ▒ 0.04% set_next_buddy ▒ 0.03% update_curr ▒ 0.02% update_min_vruntime ▒ 0.02% clear_buddies ▒ 0.00% __compute_runnable_contrib ▒
- 0.18% update_rq_clock.part.80 ▒ 0.03% dequeue_entity ▒ 0.01% sched_clock_cpu ▒ 0.01% update_cfs_shares ▒ 0.01% set_next_buddy ▒ 0.00% hrtick_update ▒
- 3.06% pick_next_task_fair ▒
- 0.08% _raw_spin_lock_irq ▒ 0.07% finish_task_switch ▒
- 0.05% apic_timer_interrupt ▒ 0.03% pick_next_entity ▒ 0.03% put_prev_entity ▒ 0.02% check_cfs_rq_runtime ▒ 0.02% set_next_entity ▒ 0.01% rcu_note_context_switch ▒ 0.01% dequeue_task_fair ▒ 0.01% update_curr ▒ 0.01% update_rq_clock.part.80 ▒
- 0.00% pick_next_task_idle ▒
- 0.00% reschedule_interrupt ▒ 0.00% __msecs_to_jiffies ▒ 0.00% smp_reschedule_interrupt ▒ 0.00% ret_from_intr ▒ 0.01% rcu_note_context_switch ▒ 0.01% _raw_spin_lock_irq ▒ 0.01% pick_next_task_fair ▒ 0.01% finish_task_switch ▒ 0.01% deactivate_task ▒ 0.00% pick_next_task_idle ▒
- 0.48% prepare_to_wait_exclusive ▒
- 0.22% mutex_unlock ▒ 0.02% __schedule ▒
- 0.02% finish_wait ▒ 0.02% _raw_spin_unlock_irqrestore ▒ 0.01% mutex_lock_slowpath ▒ 0.01% _cond_resched ▒ 0.00% _raw_spin_lock_irqsave ▒ 0.00% mutex_unlock_slowpath ▒
- 0.00% apic_timer_interrupt ▒ 0.18% mutex_unlock ▒ 0.12% schedule ▒ 0.01% prepare_to_wait_exclusive ▒ 0.01% finish_wait ▒ 0.01% mutex_lock ▒
- 0.01% apic_timer_interrupt ▒
- 12.11% mutex_lock ▒
- 11.86% __mutex_lock_slowpath ▒
- 11.73% mutex_optimistic_spin ▒ 6.61% mutex_spin_on_owner.isra.3 ▒ 4.57% osq_lock ▒ 0.10% osq_unlock ▒
- 0.07% schedule_preempt_disabled ▒
- 0.05% apic_timer_interrupt ▒
- 0.07% schedule_preempt_disabled ▒
- 0.02% _raw_spin_lock ▒ 0.01% osq_lock ▒ 0.01% mutex_spin_on_owner.isra.3 ▒ 0.00% osq_unlock ▒ — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/5400#issuecomment-263614937, or mute the thread https://github.com/notifications/unsubscribe-auth/AA08zcV0CYG2nOGlHusqeTVaXgAWC8Qfks5rDE6IgaJpZM4KyTre.
@richardelling thank you for the hint. indeed otime is vert short. what would you recommend to do next?
this is with pool idle, then with pool hung:
41 0 0x01 20 2240 355956345204422 592620693180043
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
390335741 592524089526612 C 0 0 0 0 0 4999813463 9558 10102 34374
390335742 592529089340075 C 0 0 0 0 0 5000066598 7923 80330 74050
390335743 592534089406673 C 0 0 0 0 0 4999952299 6910 43929 52993
390335744 592539089358972 C 0 0 0 0 0 4999737267 7072 8194 54133
390335745 592544089096239 C 0 4608 0 2 0 4999978265 7553 8449 4681435
390335746 592549089074504 C 0 0 0 0 0 4999966827 5928 11332 57286
390335747 592554089041331 C 0 0 0 0 0 4999991506 5522 12925 55682
390335748 592559089032837 C 0 0 0 0 0 4999964554 6597 14889 51977
390335749 592564088997391 C 0 0 0 0 0 4999972149 9605 19025 74637
390335750 592569088969540 C 0 0 0 0 0 4999905451 7589 19669 70838
390335751 592574088874991 C 0 0 0 0 0 4999905973 7486 21393 75209
390335752 592579088780964 C 0 0 0 0 0 4999947644 7569 20411 73451
390335753 592584088728608 C 0 0 0 0 0 4999932733 7519 21381 73751
390335754 592589088661341 C 0 0 0 0 0 4999927671 6868 18351 66735
390335755 592594088589012 C 0 0 0 0 0 4999959805 8398 21063 74540
390335756 592599088548817 C 0 0 0 0 0 4999940195 7516 21099 74230
390335757 592604088489012 C 0 0 0 0 0 4999936205 9029 21370 73839
390335758 592609088425217 C 0 0 0 0 0 4999946138 8542 21363 73451
390335759 592614088371355 C 0 0 0 0 0 4999930517 9123 21093 72391
390335760 592619088301872 O 0 0 0 0 0 0 0 0 0
41 0 0x01 20 2240 355956345204422 593059687794602
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
394798944 593059687507669 C 0 0 0 0 0 15431 3042 10984 15352
394798945 593059687523100 C 0 0 0 0 0 18348 3407 9931 12753
394798946 593059687541448 C 0 0 0 0 0 14757 2928 11041 11982
394798947 593059687556205 C 0 0 0 0 0 15195 3195 9883 11183
394798948 593059687571400 C 0 0 0 0 0 14734 3007 8949 12170
394798949 593059687586134 C 0 0 0 0 0 13584 3234 9789 12375
394798950 593059687599718 C 0 0 0 0 0 14194 2946 10668 11913
394798951 593059687613912 C 0 0 0 0 0 14842 3078 9861 11320
394798952 593059687628754 C 0 0 0 0 0 14130 3099 9529 11692
394798953 593059687642884 C 0 0 0 0 0 14128 3305 9119 11686
394798954 593059687657012 C 0 0 0 0 0 13594 2922 9818 11811
394798955 593059687670606 C 0 0 0 0 0 13931 3091 9930 12656
394798956 593059687684537 C 0 0 0 0 0 14249 3182 10783 13954
394798957 593059687698786 C 0 0 0 0 0 18395 3444 8221 11817
394798958 593059687717181 C 0 0 0 0 0 15273 2930 7713 11697
394798959 593059687732454 C 0 0 0 0 0 11932 3311 9513 11913
394798960 593059687744386 C 0 0 0 0 0 13893 3217 9898 11206
394798961 593059687758279 S 0 0 0 0 0 14526 3284 8772 0
394798962 593059687772805 W 0 0 0 0 0 16397 3030 0 0
394798963 593059687789202 O 0 0 0 0 0 0 0 0 0
zfs list -o space
is a good way to see if you're near a quota limit.
I am away from that pool over the weekend so I won't be able to paste the output, but I know I have not setup quota on any dataset. The pool is using about 10% of its capacity. I will ask on lxd github if they setup quota by default on their datasets, but I highly doubt quota has anything to do with this, because if I start the containers one by one I don't experience the hang. I must reiterate what I've said before, that this only happens when I give lxd many simultaneous tasks and at the same time there is little free ram.
I have also managed to produce similar txg_sync and txg_quiesce cpu load by running many 'sync' commands in parallel, but in that case the pool is still usable and the tasks do not stay constantly at 100% and 90% like in my case.
@richardelling snip from zfs get quota
and zfs list -o space
; full output on pastebin if you want (all datasets have value 'none' with source 'default' for quota and refreserv attributes)
# zfs get -r quota pool|grep -v -- zfs-auto-snap
NAME PROPERTY VALUE SOURCE
pool/lxd quota none default
pool/lxd/containers quota none default
pool/lxd/containers/gitlab quota none default
# zfs get refreserv -t filesystem pool/lxd -r
NAME PROPERTY VALUE SOURCE
pool/lxd refreservation none default
pool/lxd/containers refreservation none default
pool/lxd/containers/gitlab refreservation none default
pool/lxd/containers/hang refreservation none default
pool/lxd/containers/ipsec-client refreservation none default
# zfs list -o space -r pool/lxd
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
pool/lxd 875G 30.9G 9K 19K 0 30.9G
pool/lxd/containers 875G 27.7G 0 19K 0 27.7G
pool/lxd/containers/gitlab 875G 1010M 482M 528M 0 0
pool/lxd/containers/hang 875G 47K 0 47K 0 0
pool/lxd/containers/ipsec-client 875G 582M 477M 105M 0 0
# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool 1008G 101G 907G - 6% 10% 1.00x ONLINE -
@behlendorf @richardelling
so i hit this today with fairly recent zfs and kernel (4.9.20+) txg_sync & txg_quiesce were pegged at 100% cpu as were some qemu processes that were (transiently) stuck in D and unable to terminate
i ended up removing l2arc ssd's from the pool and everything came unstuck...
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
100% cpu wouldn't bother me but when it occurs I cannot read/write data on the pool except zpool/zfs get commands perf top -p $pidof txg_sync shows this
sometimes it goes away after 3 > drop_caches (or maybe just a coincidence and gone away by itself), most other times i must reboot.