Write performance breakdown to 0.2 % of initial speed, I/O intensive processes blocking for up to 1 hour

I've regularrily experienced a breakdown of write performance to ~0.2 % of the initial speed for different vdevs (sparse files on ext4 filesystem as well as zpool on entire disk) after the device performed incredibly well at the beginning. The symptoms are identical for all vdevs:

no matter whether a large file (~1TB) or a bunch of small files (~1000 à 100 MB) are transfered the initial performance is great, then comes the breakdown (< 1 MB/s for all devices I used to far, all performing with > 120 MB/s over USB 3.0). Transfers of a single small file seem to be less vunerable to the issue than large files or bunches of small files.
same behavior for mv, cp, rsync and nautilus used for file transfer
during the transfer the processes rarely leave the state "uninterruptible"
aborting the transfer (sending SIGINT to mv, cp, rsync or pressing abort button in nautilus) causes the processes to stay in state "uninterruptible" and blocking for up to 1 hour finishing the write (even for 1 GB of data transfered in 5 seconds in the best case scenario I experienced). The same happens from file to file after a transfer almost has completed (rsync and nautilus show 100 % completion (might be 99.999%)). Right before the process returns (successfully or with returncode != 0 depending on whether it has been aborted or not) the write performance is the usual, i.e. the last 200 MB are usually written in a burst/peak after this 1 hour of starving at < 1MB/s read/write).

After initializing the pool I copied 400 GB of a 1TB file with 200 MB/s (high CPU and memory usage, but incredibly great performance), then a breakdown occured.

Further information

# zpool list
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
data      390G   310G  80,3G    79%  1.07x  ONLINE  -
rpool1    195G  90,2G   105G    46%  1.32x  ONLINE  -
samsung   928G   178G   750G    19%  1.19x  ONLINE  -
wd2      4,97T   367G  4,61T     7%  1.00x  ONLINE  -

data and rpool1 are on partitions on the same disk, samsung is on a SAMSUNG HD103SI 1TB HDD connected via USB 3.0 and wd2 is a 5TB sparse file on a ext4 filesystem on a WD WD50EFRX 5TB HDD connected via USB 3.0 as well. The pool on the sparse file might cause issues, but it isn't involved directly as transfer source or target (see #2880 for issues directly related).

I'm using 0.6.3 on Ubuntu 14.10 amd64 with Linux 3.16.7 and tested with 3.16.0-24-generic as well.

@krichter722 see my comment in https://github.com/zfsonlinux/zfs/issues/2880, the problem is most likely the same as there.

Sorry, I just mixed up my own issues. #2880 is related to the sparse files on ext4 and double defragmentation and the overhead you've described there. This issue is unrelated and I tested transfers on the same pool in the same dataset, on the same pool between different datasets and between pools in combination with compression property set to off, lz4 or gzip and dedup set to on and off. lz4_compress is enabled on all pools. I run zpool export <name of sparse file pool> and rebooted, but it doesn't have an effect, the write breakdown still occurs.

Could you please supply the contents of /proc/cpuinfo /proc/meminfo /proc/spl/kstat/zfs/arcstats and the output of iostat 10 3 while the system is in the state you have a problem with?

Yes, of course, thanks for your interest! The following data corresponds to the state during copying ~ 1000 video files (to give you a clue about compression ration which can be expected) of 100 GB size from an 5TB ext4 partition (70 % full, covering the whole device connected via USB 3.0) to the samsung pool (details below). Average performance is 4 MB/s with (~400 KB/s over minutes and then bursts between 80 and 170 MB/s). Copying run for 10 minutes and proceeded < 5 %.

$ cat  /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 2201.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 1200.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 1600.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 1200.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 3
initial apicid  : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 4
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 2201.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 4
initial apicid  : 4
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 5
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 1500.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 5
initial apicid  : 5
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 6
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 1200.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 6
initial apicid  : 6
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping    : 9
microcode   : 0x17
cpu MHz     : 2201.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
initial apicid  : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 4389.79
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

$ cat  /proc/meminfo
MemTotal:       16286288 kB
MemFree:          210544 kB
Buffers:            9568 kB
Cached:          5795852 kB
SwapCached:            0 kB
Active:          1424244 kB
Inactive:        5393172 kB
Active(anon):    1014096 kB
Inactive(anon):   323556 kB
Active(file):     410148 kB
Inactive(file):  5069616 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8041468 kB
SwapFree:        8041468 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:       1011548 kB
Mapped:           277056 kB
Shmem:            325664 kB
Slab:             572056 kB
SReclaimable:     147304 kB
SUnreclaim:       424752 kB
KernelStack:        7056 kB
PageTables:        26744 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16184612 kB
Committed_AS:    3362756 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     7459012 kB
VmallocChunk:   34350412096 kB
HardwareCorrupted:     0 kB
AnonHugePages:    186368 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      239608 kB
DirectMap2M:    16390144 kB

$ cat  /proc/spl/kstat/zfs/arcstats
5 1 0x01 85 4080 5687918837 2503840648841
name                            type data
hits                            4    3962920
misses                          4    455145
demand_data_hits                4    1732529
demand_data_misses              4    23009
demand_metadata_hits            4    2079048
demand_metadata_misses          4    197237
prefetch_data_hits              4    644
prefetch_data_misses            4    99135
prefetch_metadata_hits          4    150699
prefetch_metadata_misses        4    135764
mru_hits                        4    1069300
mru_ghost_hits                  4    2211
mfu_hits                        4    2742421
mfu_ghost_hits                  4    849
deleted                         4    90231
recycle_miss                    4    6269
mutex_miss                      4    21
evict_skip                      4    2753315
evict_l2_cached                 4    0
evict_l2_eligible               4    5012334592
evict_l2_ineligible             4    152332288
hash_elements                   4    443194
hash_elements_max               4    444351
hash_collisions                 4    585133
hash_chains                     4    129673
hash_chain_max                  4    10
p                               4    3401697280
c                               4    4459393856
c_min                           4    4194304
c_max                           4    8338579456
size                            4    4452018496
hdr_size                        4    161985360
data_size                       4    1916845568
meta_size                       4    1826650624
other_size                      4    546536944
anon_size                       4    535326208
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    2825533440
mru_evict_data                  4    1366906368
mru_evict_metadata              4    988877312
mru_ghost_size                  4    1629555712
mru_ghost_evict_data            4    1616857600
mru_ghost_evict_metadata        4    12698112
mfu_size                        4    382636544
mfu_evict_data                  4    21542400
mfu_evict_metadata              4    194761728
mfu_ghost_size                  4    2713171456
mfu_ghost_evict_data            4    2667750400
mfu_ghost_evict_metadata        4    45421056
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    5
memory_direct_count             4    83
memory_indirect_count           4    6982
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    2535172928
arc_meta_limit                  4    6253934592
arc_meta_max                    4    2542331328

$ sudo zpool iostat samsung 10 3
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
samsung      164G   764G    172    536   839K  5,07M
samsung      164G   764G     81     24   187K  2,98M
samsung      164G   764G     70     16   165K  2,09M

I added a mirror to the samsung pool (on full device, so no sparse file defragmentation impact) today and the behavior is the exact same:

$ sudo zpool status samsung
  pool: samsung
 state: ONLINE
  scan: resilvered 54,0G in 0h46m with 0 errors on Tue Nov 11 12:12:44 2014
config:

    NAME                                               STATE     READ WRITE CKSUM
    samsung                                            ONLINE       0     0     0
      mirror-0                                         ONLINE       0     0     0
        usb-SAMSUNG_HD103SI_0123456789ABCD0000E9-0:0   ONLINE       0     0     0
        usb-asmedia_ASMT1053_00000000000000000000-0:0  ONLINE       0     0     0

errors: No known data errors

My guess is your disks are busy seeking, iostat (instead of zpool iostat) would give load for the drives also.

Could you also please post the output of zpool history samsung zpool status -D samsung and tell me how you copy the data (what kind of command you use), so i can get a clue what the pool went though, what properties are set and estimate the impact of you using dedup?

I tried both moving and copying with different programs (because I don't know them in detail and hoped to get different results for different mechanisms/usage of library functions, etc.), e.g. mv /path/to/source/ /path/to/target/, cp -a /path/to/source /path/to/target (in order to check bad write performance during deletion in /path/to/source), rsync -a /path/to/source/ /path/to/target/ (both with and without --remove-source-files).

# iostat 10 3
Linux 3.16.0-24-generic (richter-lenovo-IdeaPad-Z500)   12.11.2014  _x86_64_(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5,81    0,02    4,82   28,63    0,00   60,71

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              77,30       938,09       640,37    1616713    1103616
sdc              83,58      5242,48        10,06    9034937      17344
sdd             280,35      1249,21      5211,12    2152896    8980904

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3,02    0,01    4,45   53,68    0,00   38,84

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              31,80        36,00       385,60        360       3856
sdc               0,30         0,00         1,60          0         16
sdd             340,60       935,20     13337,20       9352     133372

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2,68    0,29    4,44   54,15    0,00   38,44

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              44,50       260,00       186,00       2600       1860
sdc               0,00         0,00         0,00          0          0
sdd             338,00       908,40     13840,40       9084     138404

#  zpool history samsung
History for 'samsung':
2014-11-10.07:31:16 zfs set dedup=on samsung
2014-11-10.07:31:25 zfs set compression=lz4 samsung
2014-11-11.06:29:23 zfs create -o recordsize=8K samsung/videos
2014-11-11.08:22:17 zpool scrub samsung
2014-11-11.13:27:59 zpool set feature@lz4_compress=enabled samsung
2014-11-11.13:28:21 zpool set feature@lz4_compress=enabled samsung
2014-11-11.13:30:05 zfs set compression=gzip samsung/videos
2014-11-11.13:54:19 zpool set feature@lz4_compress=enabled samsung
2014-11-11.14:45:25 zfs create samsung/no_compression

I experimented with recordsize != 128K on dataset samsung/videos and compressions different from lz4 in dataset samsung/videos, but that doesn't change (except that gzip consumes way more CPU than lz4 during rare write bursts.

After(!) conducting the above commands and operation generating the provided output, I accidentially pulled the USB cable for the mirror of samsung and it'll probably take 1 day to resilver (200 GB!). I'll post the output nevertheless now, I guess it might be interesting (after 13 minutes of resilvering):

zpool status -D samsung
  pool: samsung
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Nov 12 19:10:13 2014
    798M scanned out of 207G at 1,01M/s, 58h26m to go
    798M resilvered, 0,38% done
config:

    NAME                                    STATE     READ WRITE CKSUM
    samsung                                 ONLINE       0     0     0
      mirror-0                              ONLINE       0     0     0
        ata-SAMSUNG_HD103SI_S1Y5J90Z427433  ONLINE       0     0     5  (resilvering)
        ata-ST1000DM003-1ER162_S4Y048C4     ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 4891265, size 660 on disk, 213 in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    4,60M    197G    196G    196G    4,60M    197G    196G    196G
     2    61,3K   7,67G   7,63G   7,63G     123K   15,3G   15,3G   15,3G
     4        1    128K     75K     76K        4    512K    300K    304K
    32        1      8K     512      4K       61    488K   30,5K    244K
   128        1    128K      1K      4K      163   20,4M    163K    652K
 Total    4,66M    204G    203G    204G    4,72M    212G    211G    211G

Looking at the DDT i would say that activating dedup on the pool and setting 8K recordsize are the reasons for our problems. Plus USB-Drives have higher latency compared to directly connected drives.

Setting 8K blocksize inflates your metadata by factor of 16 for larger files (compared to default)
Dedup leads to metadata amplification (dedup reference counts have to be maintained), the data for the currently existing entries already need >1GB after being read into RAM (which will take some time to fetch them since they'll be scattered around the drive).

zpool history listing dosn't go back till pool creation, maybe you also have set atime=on and sync=always?

So the issue here is imho not a technical one, please recreate the pool and try with default options to get a baseline for zfs performace so you can compare the effect of 'tunings'.

Bottom line:

For storing compressed files (videos) compression is relatively pointless.
Reducing the blocksize unless you need to tune for databases shouldn't be done.
Dedup (unless you really have a worthy use-case, ideally on pools backed fast SSDs and the DDT completely fit into ARC while leaving enough space for normal operation) is a bad idea.

Thanks for your help. The dataset with 8K blocksize has only be created for testing purposes, the mentioned copy and move operations have been while moving and copying to the samsung pool (i.e. not to the specific subset) which has a recordsize of 128K (no set explicitly and therefore adjusted automatically as far as I know). I didn't change any properties besides dedup and compression on the pool

# zfs get all samsung
NAME     PROPERTY              VALUE                 SOURCE
samsung  type                  filesystem            -
samsung  creation              Mo Nov 10  7:30 2014  -
samsung  used                  215G                  -
samsung  available             706G                  -
samsung  referenced            163G                  -
samsung  compressratio         1.00x                 -
samsung  mounted               yes                   -
samsung  quota                 none                  default
samsung  reservation           none                  default
samsung  recordsize            128K                  default
samsung  mountpoint            /samsung              default
samsung  sharenfs              off                   default
samsung  checksum              on                    default
samsung  compression           lz4                   local
samsung  atime                 on                    default
samsung  devices               on                    default
samsung  exec                  on                    default
samsung  setuid                on                    default
samsung  readonly              off                   default
samsung  zoned                 off                   default
samsung  snapdir               hidden                default
samsung  aclinherit            restricted            default
samsung  canmount              on                    default
samsung  xattr                 on                    default
samsung  copies                1                     default
samsung  version               5                     -
samsung  utf8only              off                   -
samsung  normalization         none                  -
samsung  casesensitivity       sensitive             -
samsung  vscan                 off                   default
samsung  nbmand                off                   default
samsung  sharesmb              off                   default
samsung  refquota              none                  default
samsung  refreservation        none                  default
samsung  primarycache          all                   default
samsung  secondarycache        all                   default
samsung  usedbysnapshots       0                     -
samsung  usedbydataset         163G                  -
samsung  usedbychildren        52,2G                 -
samsung  usedbyrefreservation  0                     -
samsung  logbias               latency               default
samsung  dedup                 on                    local
samsung  mlslabel              none                  default
samsung  sync                  standard              default
samsung  refcompressratio      1.00x                 -
samsung  written               163G                  -
samsung  logicalused           213G                  -
samsung  logicalreferenced     163G                  -
samsung  snapdev               hidden                default
samsung  acltype               off                   default
samsung  context               none                  default
samsung  fscontext             none                  default
samsung  defcontext            none                  default
samsung  rootcontext           none                  default
samsung  relatime              off                   default

I agree with your points on compression of video files. My issue is that the breakdown to 500 KB/s write performance per se and blocking of applications for 1 hour is an extreme consequence of wrong usage of properties which can maybe be improved slightly. I wouldn't have opened it if it was at 20 or 10 % of what I know the disk to perform with ext4 or btrfs and would have investigated the configuration deeper (by posting on the mailing list and probably getting the same good advices from you there).

After no progress with adjustment of properties, I found other data which allows to confirm your points and isolate the issue. I copied 300 GB of a bup directory (> 100 files of 1 GB, the program intends to avoid saving extranous data). There're certain similarities to video files, but I succeeded to avoid the breakdown when I turn off both compression and dedup (which I didn't with the video files).

I think that the problem stems from dedup exclusively, lz4 compression should not have an ill impact (at least neither have i noticed it on my systems, nor heard about it on the mailing list).

Dedup could make sense for example in case you have many near-identical VM disk images and by deduping them you get them (and all associated metadata) to fit (and stay, given the regular OS updates we sad-/gladly see these days) completely in ARC.

But the possibility for dedup to backfire performance wise (like you experienced) is very real so one should carefully test with realworld data&load before deploying. For example: ARC warmup time (and associated I/O) after reboot could lead to performance drops of several magnitudes compared to a hot system. On normal SOHO setups imho it dosn't make sense, given the low TB/$ cost for harddrives - so it's best to stay away from it and never ever activate it.

About the 8K thingy: i thought you used the /video filesystem to copy the videos onto.

Offtopic: Good practice is to not use the base filesystem of a pool, but only use it as a container to set properties on (to inherit them to child datasets).

Glad i could help, have a good time with ZFS on Linux.

I'm also glad you help, thanks again! Here's the thing now: Today I needed to use a 3.2.x kernel (and installed 3.2.64 from ~last week) on the very same system. Guess the write performance with dedup=on and compression=lz4 on the samsung pool? 60 MB/s average and 200 MB/s peaks (which I guess is very great for the small i7 processor)!! What now?! One thing to be mentioned is that the reason for the switch to 3.2.x where problems on 3.16 with the xhci (USB 3.0) driver (kernel panic for a damaged device unrelated to the problem and detached during the described issue), but I don't believe that there's an impact on write performance (the kernel would crash or something fatal would happen, besides that the system writes 180 MB/s to ext4 filesystems on 3.16 over USB 3.0). Output of the same commands like above (except cpuinfo because I didn't touch hardware):

# zpool status -D samsung  pool: samsung
 state: ONLINE
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: resilvered 35,2G in 1h23m with 0 errors on Thu Nov 13 22:47:01 2014
config:

    NAME                                            STATE     READ WRITE CKSUM
    samsung                                         ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        ata-SAMSUNG_HD103SI_S1Y5J90Z427433          ONLINE       0     0     0
        ata-ST1000DM003-1ER162_S4Y048C4             ONLINE       0     0     0
    cache
      usb-SanDisk_Extreme_AA010127141448208023-0:0  OFFLINE      0     0     0

errors: No known data errors

 dedup: DDT entries 1737293, size 1859 on disk, 600 in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    1,51M    194G    162G    162G    1,51M    194G    162G    162G
     2     146K   18,2G   13,2G   13,3G     301K   37,6G   27,4G   27,6G
     4    1,05K    134M   86,8M   88,1M    4,82K    617M    384M    390M
     8       49   6,12M   1,97M   2,07M      493   61,6M   18,6M   19,6M
    16        7    896K    150K    168K      160     20M   2,54M   2,95M
    32        3    384K      3K     12K      132   16,5M    132K    528K
   128        1    128K      1K      4K      152     19M    152K    608K
   256        1    128K      1K      4K      256     32M    256K      1M
    1K        1    128K      1K      4K    1,24K    159M   1,24M   4,97M
 Total    1,66M    212G    175G    176G    1,81M    232G    189G    190G

# iostat 10 3
Linux 3.2.64-030264-generic (richter-lenovo-IdeaPad-Z500)   13.11.2014  _x86_64_    (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          22,68    0,01    6,58   35,68    0,00   35,06

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              20,13       148,82       272,33    2437598    4460575
sdb              40,12        28,83      1809,43     472159   29637539
sdd             335,23      9922,39      5993,75  162524132   98174824
sde             414,71      1777,57         0,00   29115808          0
sdf              37,99      1441,05      1776,46   23603725   29097644
sdh             217,41      7017,52      4803,61  114943688   78680804
sdc               0,03         0,24         0,00       3933          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          18,40    0,01    7,92   41,05    0,00   32,61

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               8,10         1,60       158,00         16       1580
sdb               0,00         0,00         0,00          0          0
sdd             448,70     14372,00      2068,80     143720      20688
sde               0,00         0,00         0,00          0          0
sdf               0,00         0,00         0,00          0          0
sdh             123,40      1007,60       556,40      10076       5564
sdc               0,00         0,00         0,00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17,09    0,00    5,57   48,77    0,00   28,57

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               9,80         0,00       259,20          0       2592
sdb               0,00         0,00         0,00          0          0
sdd             267,70     16404,40        25,60     164044        256
sde               0,00         0,00         0,00          0          0
sdf               0,00         0,00         0,00          0          0
sdh              79,00       535,60        25,60       5356        256
sdc               0,00         0,00         0,00          0          0

# cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 85 4080 7779920215 16487537062557
name                            type data
hits                            4    243728600
misses                          4    4822753
demand_data_hits                4    209242437
demand_data_misses              4    40426
demand_metadata_hits            4    33588618
demand_metadata_misses          4    943325
prefetch_data_hits              4    731369
prefetch_data_misses            4    3776559
prefetch_metadata_hits          4    166176
prefetch_metadata_misses        4    62443
mru_hits                        4    55615041
mru_ghost_hits                  4    323668
mfu_hits                        4    187237445
mfu_ghost_hits                  4    240920
deleted                         4    4879459
recycle_miss                    4    89105
mutex_miss                      4    5932
evict_skip                      4    1452505
evict_l2_cached                 4    75647101440
evict_l2_eligible               4    486206358528
evict_l2_ineligible             4    43334856704
hash_elements                   4    686460
hash_elements_max               4    850017
hash_collisions                 4    7553272
hash_chains                     4    192522
hash_chain_max                  4    13
p                               4    5344566772
c                               4    5628564864
c_min                           4    4194304
c_max                           4    8339431424
size                            4    5628486248
hdr_size                        4    128043648
data_size                       4    4760190976
meta_size                       4    531153920
other_size                      4    68732304
anon_size                       4    339646464
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    4897025536
mru_evict_data                  4    4374445568
mru_evict_metadata              4    448935936
mru_ghost_size                  4    623108608
mru_ghost_evict_data            4    318588928
mru_ghost_evict_metadata        4    304519680
mfu_size                        4    54672896
mfu_evict_data                  4    47055360
mfu_evict_metadata              4    519168
mfu_ghost_size                  4    3142035456
mfu_ghost_evict_data            4    2843678720
mfu_ghost_evict_metadata        4    298356736
l2_hits                         4    229041
l2_misses                       4    4539408
l2_feeds                        4    12449
l2_rw_clash                     4    107
l2_read_bytes                   4    436386816
l2_write_bytes                  4    30209695232
l2_writes_sent                  4    8573
l2_writes_done                  4    8573
l2_writes_error                 4    0
l2_writes_hdr_miss              4    66
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    125724
l2_abort_lowmem                 4    147
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    53856799744
l2_asize                        4    29510425600
l2_hdr_size                     4    140365400
l2_compress_successes           4    750806
l2_compress_zeros               4    0
l2_compress_failures            4    121692
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    10
memory_direct_count             4    2685
memory_indirect_count           4    541110
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    868295272
arc_meta_limit                  4    6254573568
arc_meta_max                    4    1914595304

# cat /proc/meminfo 
MemTotal:       16287952 kB
MemFree:          174196 kB
Buffers:            7480 kB
Cached:          6161460 kB
SwapCached:            0 kB
Active:          1401292 kB
Inactive:        5920764 kB
Active(anon):    1157396 kB
Inactive(anon):   368260 kB
Active(file):     243896 kB
Inactive(file):  5552504 kB
Unevictable:        3496 kB
Mlocked:            3496 kB
SwapTotal:       8041468 kB
SwapFree:        8041468 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:       1156612 kB
Mapped:           442508 kB
Shmem:            369996 kB
Slab:             201288 kB
SReclaimable:      65276 kB
SUnreclaim:       136012 kB
KernelStack:        7104 kB
PageTables:        39880 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16185444 kB
Committed_AS:    4728136 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     8211824 kB
VmallocChunk:   34346241704 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      274432 kB
DirectMap2M:    16410624 kB

The average write is still slow for most of the time, but it is made up by longer and more frequent write bursts with more throughput leading to the much better performance and there's no more blocking of applications for both SIGINT and SIGSTOP. In the meantime I added a (very fast) USB stick as cache without knowing what I'm doing and it didn't lead to any improvements both for the low performance on 3.16 and the good performance on 3.2. The above output has been produced after successful zpool offline samsung <id>.

Some thoughts:

testing performance with different settings has to be made under same circumstances, so should you run one test with a freshly imported pool and the next with the same pool (without ex-/importing the pool first) your result will be skewed because of ARC contents already filled (lowering amount of reads needed to get metadata which leaves more room for writes because the drive(s) are not busy seeking for the reads)
a pool in mirror configuration will perform significantly better (compared to a single-platter one) while fetching metadata (=random read pressure), so it is no wonder that with two drives you have better results
i take it that the mirror consists of sdd and sdh (please provide the mappings to dev/sd* devices next time or import the pool using -d /dev to make the setup more visible) - and you might notice that sdh only processes 1/3 of the (read) tps compared to sdd? sdd 267,70 16404,40 25,60 164044 256 sdh 79,00 535,60 25,60 5356 256 So should sdh be the single drive on which you did your initial tests (without the mirror) then the difference would be explained...

So in case you could run the exactly same test

identical data source (best freshly imported, so cache is cold and reading the data takes the same time each time)
identical target dataset (rollback the target filesystem to the same snapshot)
freshly imported target pool (zpool export/import it, so ARC is empty and in the same state for all tests)

under the different kernel versions and get different results for performance then we might have interesting data points for the devs to look at.

Thanks again for your input. I understand the difficulty to retrieve a test case from my information. Therefore before getting into providing a reproducible scenario I'll check some other things (not as a sort of support request, but just to further isolate the issue from another perspective than the tests you suggested above - before I get started with them), because hardware is hard...

I became curious and changed the setup of the pool as good as I can (afford ;)) and made the slow device the second device of the samsung pool rather than the fast device mirror of the slow (which brought the write speed up to 210 MB/s average without compression and dedup) and bought a 500 GB SSD ata-Samsung_SSD_840_EVO_500GB_S1DKNWAFA00683Y which I added as cache (with default values). Then I recreated the pool by copying data to an ext4 filesystem and then back.

The thing is that now - and that's what I think requires your attention again, @GregorKopka - that I can turn dedup on and compression off and have 210 MB/s speed and dedup off and compression on and have 210 MB/s speed. Only turning on both dedup and compression turned on (compression to gzip and lz4 makes no difference) brings write speed breakdown again to 3 MB/s (the improvement in comparison to < 1 MS/s is due to the SSD cache, I guess). Doesn't that mean that the breakdown is due to the combination of dedup and compression rather than insufficient power of hardware (shouldn't it be rather trivial to compress the data after it has been deduped)?

@krichter722 I tried to reproduce your slowdown with dedup+compression, but failed.

System:

AMD Athlon(tm) II X2 260 Processor 3.2GHz, 16GB RAM 
Linux version 3.7.10-gentoo (root@system) (gcc version 4.6.3 (Gentoo Hardened 4.6.3 p1.11, pie-0.5.2) ) #1 SMP Thu Jul 17 10:18:59 CEST 2014
sys-kernel/spl-0.6.3, sys-fs/zfs-kmod-0.6.3, sys-fs/zfs-0.6.3-r2
Test pool: raidz over 5 2TB 7200RPM drives, no L2ARC.

I setup a fresh dataset, then copied a video file 10 times onto it using $ time for i in {01..10}; do dd bs=1M if=//file.avi of=//$i; done

compression=lz4, dedup=on

733722624 bytes (734 MB) copied, 0,647128 s, 1,1 GB/s
733722624 bytes (734 MB) copied, 3,25016 s, 226 MB/s
733722624 bytes (734 MB) copied, 10,3316 s, 71,0 MB/s
733722624 bytes (734 MB) copied, 8,62427 s, 85,1 MB/s
733722624 bytes (734 MB) copied, 2,84357 s, 258 MB/s
733722624 bytes (734 MB) copied, 7,27558 s, 101 MB/s
733722624 bytes (734 MB) copied, 9,30951 s, 78,8 MB/s
733722624 bytes (734 MB) copied, 4,22566 s, 174 MB/s
733722624 bytes (734 MB) copied, 5,9336 s, 124 MB/s
733722624 bytes (734 MB) copied, 9,01479 s, 81,4 MB/s
real    1m1.494s
user    0m0.010s
sys     0m6.660s

load of 8.5, nearly no I/O on disk after 15 seconds

compression=lz4:

733722624 bytes (734 MB) copied, 0,752108 s, 976 MB/s
733722624 bytes (734 MB) copied, 2,27194 s, 323 MB/s
733722624 bytes (734 MB) copied, 3,96534 s, 185 MB/s
733722624 bytes (734 MB) copied, 4,34413 s, 169 MB/s
733722624 bytes (734 MB) copied, 4,57099 s, 161 MB/s
733722624 bytes (734 MB) copied, 4,61166 s, 159 MB/s
733722624 bytes (734 MB) copied, 4,10418 s, 179 MB/s
733722624 bytes (734 MB) copied, 4,30493 s, 170 MB/s
733722624 bytes (734 MB) copied, 5,4572 s, 134 MB/s
733722624 bytes (734 MB) copied, 4,04915 s, 181 MB/s

real    0m38.586s
user    0m0.010s
sys     0m8.430s

saturated writes (lasting till 6 seconds after completion)

dedup=on:

733722624 bytes (734 MB) copied, 0,786477 s, 933 MB/s
733722624 bytes (734 MB) copied, 2,42359 s, 303 MB/s
733722624 bytes (734 MB) copied, 13,3954 s, 54,8 MB/s
733722624 bytes (734 MB) copied, 4,64372 s, 158 MB/s
733722624 bytes (734 MB) copied, 4,27365 s, 172 MB/s
733722624 bytes (734 MB) copied, 5,809 s, 126 MB/s
733722624 bytes (734 MB) copied, 7,66382 s, 95,7 MB/s
733722624 bytes (734 MB) copied, 8,54749 s, 85,8 MB/s
733722624 bytes (734 MB) copied, 1,98384 s, 370 MB/s
733722624 bytes (734 MB) copied, 7,40234 s, 99,1 MB/s

real    0m56.948s
user    0m0.030s
sys     0m6.680s

nearly no I/O on disks after 15 seconds

Vanilla dataset (no compression/dedup)

733722624 bytes (734 MB) copied, 0,698182 s, 1,1 GB/s
733722624 bytes (734 MB) copied, 2,42061 s, 303 MB/s
733722624 bytes (734 MB) copied, 3,53357 s, 208 MB/s
733722624 bytes (734 MB) copied, 4,48431 s, 164 MB/s
733722624 bytes (734 MB) copied, 4,09832 s, 179 MB/s
733722624 bytes (734 MB) copied, 3,68015 s, 199 MB/s
733722624 bytes (734 MB) copied, 4,13058 s, 178 MB/s
733722624 bytes (734 MB) copied, 4,29526 s, 171 MB/s
733722624 bytes (734 MB) copied, 3,5893 s, 204 MB/s
733722624 bytes (734 MB) copied, 4,37277 s, 168 MB/s

real    0m35.392s
user    0m0.040s
sys     0m8.200s

saturated writes (lasting till 8 seconds after completion)

So real time has been:

vanilla:    0m35.392s
lz4:        0m38.586s
dedup:      0m56.948s
dedup+lz4:  1m01.494s

Looks to me that the overhead of compression and dedup roughly stacks.

Granted: the data i wrote fits into RAM, was completely freshly written, DDT dito (so there was no need for zfs to seek on-disk to read).

Technically afaik dedup stage is after the data has been compressed (so only working on on-disk data).

openzfs / zfs

Write performance breakdown to 0.2 % of initial speed, I/O intensive processes blocking for up to 1 hour #2885