Closed krichter722 closed 5 years ago
@krichter722 see my comment in https://github.com/zfsonlinux/zfs/issues/2880, the problem is most likely the same as there.
Sorry, I just mixed up my own issues. #2880 is related to the sparse files on ext4 and double defragmentation and the overhead you've described there. This issue is unrelated and I tested transfers on the same pool in the same dataset, on the same pool between different datasets and between pools in combination with compression
property set to off
, lz4
or gzip
and dedup
set to on
and off
. lz4_compress
is enabled on all pools. I run zpool export <name of sparse file pool>
and rebooted, but it doesn't have an effect, the write breakdown still occurs.
Could you please supply the contents of /proc/cpuinfo /proc/meminfo /proc/spl/kstat/zfs/arcstats and the output of iostat 10 3 while the system is in the state you have a problem with?
Yes, of course, thanks for your interest! The following data corresponds to the state during copying ~ 1000 video files (to give you a clue about compression ration which can be expected) of 100 GB size from an 5TB ext4 partition (70 % full, covering the whole device connected via USB 3.0) to the samsung pool (details below). Average performance is 4 MB/s with (~400 KB/s over minutes and then bursts between 80 and 170 MB/s). Copying run for 10 minutes and proceeded < 5 %.
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 2201.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 1200.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 1600.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 1200.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 2201.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 1500.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 1200.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
stepping : 9
microcode : 0x17
cpu MHz : 2201.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 4389.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
$ cat /proc/meminfo
MemTotal: 16286288 kB
MemFree: 210544 kB
Buffers: 9568 kB
Cached: 5795852 kB
SwapCached: 0 kB
Active: 1424244 kB
Inactive: 5393172 kB
Active(anon): 1014096 kB
Inactive(anon): 323556 kB
Active(file): 410148 kB
Inactive(file): 5069616 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 8041468 kB
SwapFree: 8041468 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 1011548 kB
Mapped: 277056 kB
Shmem: 325664 kB
Slab: 572056 kB
SReclaimable: 147304 kB
SUnreclaim: 424752 kB
KernelStack: 7056 kB
PageTables: 26744 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16184612 kB
Committed_AS: 3362756 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 7459012 kB
VmallocChunk: 34350412096 kB
HardwareCorrupted: 0 kB
AnonHugePages: 186368 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 239608 kB
DirectMap2M: 16390144 kB
$ cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 85 4080 5687918837 2503840648841
name type data
hits 4 3962920
misses 4 455145
demand_data_hits 4 1732529
demand_data_misses 4 23009
demand_metadata_hits 4 2079048
demand_metadata_misses 4 197237
prefetch_data_hits 4 644
prefetch_data_misses 4 99135
prefetch_metadata_hits 4 150699
prefetch_metadata_misses 4 135764
mru_hits 4 1069300
mru_ghost_hits 4 2211
mfu_hits 4 2742421
mfu_ghost_hits 4 849
deleted 4 90231
recycle_miss 4 6269
mutex_miss 4 21
evict_skip 4 2753315
evict_l2_cached 4 0
evict_l2_eligible 4 5012334592
evict_l2_ineligible 4 152332288
hash_elements 4 443194
hash_elements_max 4 444351
hash_collisions 4 585133
hash_chains 4 129673
hash_chain_max 4 10
p 4 3401697280
c 4 4459393856
c_min 4 4194304
c_max 4 8338579456
size 4 4452018496
hdr_size 4 161985360
data_size 4 1916845568
meta_size 4 1826650624
other_size 4 546536944
anon_size 4 535326208
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 2825533440
mru_evict_data 4 1366906368
mru_evict_metadata 4 988877312
mru_ghost_size 4 1629555712
mru_ghost_evict_data 4 1616857600
mru_ghost_evict_metadata 4 12698112
mfu_size 4 382636544
mfu_evict_data 4 21542400
mfu_evict_metadata 4 194761728
mfu_ghost_size 4 2713171456
mfu_ghost_evict_data 4 2667750400
mfu_ghost_evict_metadata 4 45421056
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_hdr_miss 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 5
memory_direct_count 4 83
memory_indirect_count 4 6982
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 2535172928
arc_meta_limit 4 6253934592
arc_meta_max 4 2542331328
$ sudo zpool iostat samsung 10 3
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
samsung 164G 764G 172 536 839K 5,07M
samsung 164G 764G 81 24 187K 2,98M
samsung 164G 764G 70 16 165K 2,09M
I added a mirror to the samsung pool (on full device, so no sparse file defragmentation impact) today and the behavior is the exact same:
$ sudo zpool status samsung
pool: samsung
state: ONLINE
scan: resilvered 54,0G in 0h46m with 0 errors on Tue Nov 11 12:12:44 2014
config:
NAME STATE READ WRITE CKSUM
samsung ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
usb-SAMSUNG_HD103SI_0123456789ABCD0000E9-0:0 ONLINE 0 0 0
usb-asmedia_ASMT1053_00000000000000000000-0:0 ONLINE 0 0 0
errors: No known data errors
My guess is your disks are busy seeking, iostat (instead of zpool iostat) would give load for the drives also.
Could you also please post the output of zpool history samsung zpool status -D samsung and tell me how you copy the data (what kind of command you use), so i can get a clue what the pool went though, what properties are set and estimate the impact of you using dedup?
I tried both moving and copying with different programs (because I don't know them in detail and hoped to get different results for different mechanisms/usage of library functions, etc.), e.g. mv /path/to/source/ /path/to/target/
, cp -a /path/to/source /path/to/target
(in order to check bad write performance during deletion in /path/to/source
), rsync -a /path/to/source/ /path/to/target/
(both with and without --remove-source-files
).
# iostat 10 3
Linux 3.16.0-24-generic (richter-lenovo-IdeaPad-Z500) 12.11.2014 _x86_64_(8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5,81 0,02 4,82 28,63 0,00 60,71
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 77,30 938,09 640,37 1616713 1103616
sdc 83,58 5242,48 10,06 9034937 17344
sdd 280,35 1249,21 5211,12 2152896 8980904
avg-cpu: %user %nice %system %iowait %steal %idle
3,02 0,01 4,45 53,68 0,00 38,84
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 31,80 36,00 385,60 360 3856
sdc 0,30 0,00 1,60 0 16
sdd 340,60 935,20 13337,20 9352 133372
avg-cpu: %user %nice %system %iowait %steal %idle
2,68 0,29 4,44 54,15 0,00 38,44
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 44,50 260,00 186,00 2600 1860
sdc 0,00 0,00 0,00 0 0
sdd 338,00 908,40 13840,40 9084 138404
# zpool history samsung
History for 'samsung':
2014-11-10.07:31:16 zfs set dedup=on samsung
2014-11-10.07:31:25 zfs set compression=lz4 samsung
2014-11-11.06:29:23 zfs create -o recordsize=8K samsung/videos
2014-11-11.08:22:17 zpool scrub samsung
2014-11-11.13:27:59 zpool set feature@lz4_compress=enabled samsung
2014-11-11.13:28:21 zpool set feature@lz4_compress=enabled samsung
2014-11-11.13:30:05 zfs set compression=gzip samsung/videos
2014-11-11.13:54:19 zpool set feature@lz4_compress=enabled samsung
2014-11-11.14:45:25 zfs create samsung/no_compression
I experimented with recordsize != 128K on dataset samsung/videos
and compressions different from lz4 in dataset samsung/videos
, but that doesn't change (except that gzip consumes way more CPU than lz4 during rare write bursts.
After(!) conducting the above commands and operation generating the provided output, I accidentially pulled the USB cable for the mirror of samsung
and it'll probably take 1 day to resilver (200 GB!). I'll post the output nevertheless now, I guess it might be interesting (after 13 minutes of resilvering):
zpool status -D samsung
pool: samsung
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Nov 12 19:10:13 2014
798M scanned out of 207G at 1,01M/s, 58h26m to go
798M resilvered, 0,38% done
config:
NAME STATE READ WRITE CKSUM
samsung ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_HD103SI_S1Y5J90Z427433 ONLINE 0 0 5 (resilvering)
ata-ST1000DM003-1ER162_S4Y048C4 ONLINE 0 0 0
errors: No known data errors
dedup: DDT entries 4891265, size 660 on disk, 213 in core
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 4,60M 197G 196G 196G 4,60M 197G 196G 196G
2 61,3K 7,67G 7,63G 7,63G 123K 15,3G 15,3G 15,3G
4 1 128K 75K 76K 4 512K 300K 304K
32 1 8K 512 4K 61 488K 30,5K 244K
128 1 128K 1K 4K 163 20,4M 163K 652K
Total 4,66M 204G 203G 204G 4,72M 212G 211G 211G
Looking at the DDT i would say that activating dedup on the pool and setting 8K recordsize are the reasons for our problems. Plus USB-Drives have higher latency compared to directly connected drives.
zpool history listing dosn't go back till pool creation, maybe you also have set atime=on and sync=always?
So the issue here is imho not a technical one, please recreate the pool and try with default options to get a baseline for zfs performace so you can compare the effect of 'tunings'.
Bottom line:
Thanks for your help. The dataset with 8K blocksize has only be created for testing purposes, the mentioned copy and move operations have been while moving and copying to the samsung
pool (i.e. not to the specific subset) which has a recordsize
of 128K (no set explicitly and therefore adjusted automatically as far as I know). I didn't change any properties besides dedup
and compression
on the pool
# zfs get all samsung
NAME PROPERTY VALUE SOURCE
samsung type filesystem -
samsung creation Mo Nov 10 7:30 2014 -
samsung used 215G -
samsung available 706G -
samsung referenced 163G -
samsung compressratio 1.00x -
samsung mounted yes -
samsung quota none default
samsung reservation none default
samsung recordsize 128K default
samsung mountpoint /samsung default
samsung sharenfs off default
samsung checksum on default
samsung compression lz4 local
samsung atime on default
samsung devices on default
samsung exec on default
samsung setuid on default
samsung readonly off default
samsung zoned off default
samsung snapdir hidden default
samsung aclinherit restricted default
samsung canmount on default
samsung xattr on default
samsung copies 1 default
samsung version 5 -
samsung utf8only off -
samsung normalization none -
samsung casesensitivity sensitive -
samsung vscan off default
samsung nbmand off default
samsung sharesmb off default
samsung refquota none default
samsung refreservation none default
samsung primarycache all default
samsung secondarycache all default
samsung usedbysnapshots 0 -
samsung usedbydataset 163G -
samsung usedbychildren 52,2G -
samsung usedbyrefreservation 0 -
samsung logbias latency default
samsung dedup on local
samsung mlslabel none default
samsung sync standard default
samsung refcompressratio 1.00x -
samsung written 163G -
samsung logicalused 213G -
samsung logicalreferenced 163G -
samsung snapdev hidden default
samsung acltype off default
samsung context none default
samsung fscontext none default
samsung defcontext none default
samsung rootcontext none default
samsung relatime off default
I agree with your points on compression of video files. My issue is that the breakdown to 500 KB/s write performance per se and blocking of applications for 1 hour is an extreme consequence of wrong usage of properties which can maybe be improved slightly. I wouldn't have opened it if it was at 20 or 10 % of what I know the disk to perform with ext4 or btrfs and would have investigated the configuration deeper (by posting on the mailing list and probably getting the same good advices from you there).
After no progress with adjustment of properties, I found other data which allows to confirm your points and isolate the issue. I copied 300 GB of a bup directory (> 100 files of 1 GB, the program intends to avoid saving extranous data). There're certain similarities to video files, but I succeeded to avoid the breakdown when I turn off both compression
and dedup
(which I didn't with the video files).
I think that the problem stems from dedup exclusively, lz4 compression should not have an ill impact (at least neither have i noticed it on my systems, nor heard about it on the mailing list).
Dedup could make sense for example in case you have many near-identical VM disk images and by deduping them you get them (and all associated metadata) to fit (and stay, given the regular OS updates we sad-/gladly see these days) completely in ARC.
But the possibility for dedup to backfire performance wise (like you experienced) is very real so one should carefully test with realworld data&load before deploying. For example: ARC warmup time (and associated I/O) after reboot could lead to performance drops of several magnitudes compared to a hot system. On normal SOHO setups imho it dosn't make sense, given the low TB/$ cost for harddrives - so it's best to stay away from it and never ever activate it.
About the 8K thingy: i thought you used the /video filesystem to copy the videos onto.
Offtopic: Good practice is to not use the base filesystem of a pool, but only use it as a container to set properties on (to inherit them to child datasets).
Glad i could help, have a good time with ZFS on Linux.
I'm also glad you help, thanks again! Here's the thing now: Today I needed to use a 3.2.x kernel (and installed 3.2.64 from ~last week) on the very same system. Guess the write performance with dedup=on
and compression=lz4
on the samsung pool? 60 MB/s average and 200 MB/s peaks (which I guess is very great for the small i7 processor)!! What now?! One thing to be mentioned is that the reason for the switch to 3.2.x where problems on 3.16 with the xhci (USB 3.0) driver (kernel panic for a damaged device unrelated to the problem and detached during the described issue), but I don't believe that there's an impact on write performance (the kernel would crash or something fatal would happen, besides that the system writes 180 MB/s to ext4 filesystems on 3.16 over USB 3.0). Output of the same commands like above (except cpuinfo
because I didn't touch hardware):
# zpool status -D samsung pool: samsung
state: ONLINE
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 35,2G in 1h23m with 0 errors on Thu Nov 13 22:47:01 2014
config:
NAME STATE READ WRITE CKSUM
samsung ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_HD103SI_S1Y5J90Z427433 ONLINE 0 0 0
ata-ST1000DM003-1ER162_S4Y048C4 ONLINE 0 0 0
cache
usb-SanDisk_Extreme_AA010127141448208023-0:0 OFFLINE 0 0 0
errors: No known data errors
dedup: DDT entries 1737293, size 1859 on disk, 600 in core
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 1,51M 194G 162G 162G 1,51M 194G 162G 162G
2 146K 18,2G 13,2G 13,3G 301K 37,6G 27,4G 27,6G
4 1,05K 134M 86,8M 88,1M 4,82K 617M 384M 390M
8 49 6,12M 1,97M 2,07M 493 61,6M 18,6M 19,6M
16 7 896K 150K 168K 160 20M 2,54M 2,95M
32 3 384K 3K 12K 132 16,5M 132K 528K
128 1 128K 1K 4K 152 19M 152K 608K
256 1 128K 1K 4K 256 32M 256K 1M
1K 1 128K 1K 4K 1,24K 159M 1,24M 4,97M
Total 1,66M 212G 175G 176G 1,81M 232G 189G 190G
# iostat 10 3
Linux 3.2.64-030264-generic (richter-lenovo-IdeaPad-Z500) 13.11.2014 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
22,68 0,01 6,58 35,68 0,00 35,06
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 20,13 148,82 272,33 2437598 4460575
sdb 40,12 28,83 1809,43 472159 29637539
sdd 335,23 9922,39 5993,75 162524132 98174824
sde 414,71 1777,57 0,00 29115808 0
sdf 37,99 1441,05 1776,46 23603725 29097644
sdh 217,41 7017,52 4803,61 114943688 78680804
sdc 0,03 0,24 0,00 3933 0
avg-cpu: %user %nice %system %iowait %steal %idle
18,40 0,01 7,92 41,05 0,00 32,61
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 8,10 1,60 158,00 16 1580
sdb 0,00 0,00 0,00 0 0
sdd 448,70 14372,00 2068,80 143720 20688
sde 0,00 0,00 0,00 0 0
sdf 0,00 0,00 0,00 0 0
sdh 123,40 1007,60 556,40 10076 5564
sdc 0,00 0,00 0,00 0 0
avg-cpu: %user %nice %system %iowait %steal %idle
17,09 0,00 5,57 48,77 0,00 28,57
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 9,80 0,00 259,20 0 2592
sdb 0,00 0,00 0,00 0 0
sdd 267,70 16404,40 25,60 164044 256
sde 0,00 0,00 0,00 0 0
sdf 0,00 0,00 0,00 0 0
sdh 79,00 535,60 25,60 5356 256
sdc 0,00 0,00 0,00 0 0
# cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 85 4080 7779920215 16487537062557
name type data
hits 4 243728600
misses 4 4822753
demand_data_hits 4 209242437
demand_data_misses 4 40426
demand_metadata_hits 4 33588618
demand_metadata_misses 4 943325
prefetch_data_hits 4 731369
prefetch_data_misses 4 3776559
prefetch_metadata_hits 4 166176
prefetch_metadata_misses 4 62443
mru_hits 4 55615041
mru_ghost_hits 4 323668
mfu_hits 4 187237445
mfu_ghost_hits 4 240920
deleted 4 4879459
recycle_miss 4 89105
mutex_miss 4 5932
evict_skip 4 1452505
evict_l2_cached 4 75647101440
evict_l2_eligible 4 486206358528
evict_l2_ineligible 4 43334856704
hash_elements 4 686460
hash_elements_max 4 850017
hash_collisions 4 7553272
hash_chains 4 192522
hash_chain_max 4 13
p 4 5344566772
c 4 5628564864
c_min 4 4194304
c_max 4 8339431424
size 4 5628486248
hdr_size 4 128043648
data_size 4 4760190976
meta_size 4 531153920
other_size 4 68732304
anon_size 4 339646464
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 4897025536
mru_evict_data 4 4374445568
mru_evict_metadata 4 448935936
mru_ghost_size 4 623108608
mru_ghost_evict_data 4 318588928
mru_ghost_evict_metadata 4 304519680
mfu_size 4 54672896
mfu_evict_data 4 47055360
mfu_evict_metadata 4 519168
mfu_ghost_size 4 3142035456
mfu_ghost_evict_data 4 2843678720
mfu_ghost_evict_metadata 4 298356736
l2_hits 4 229041
l2_misses 4 4539408
l2_feeds 4 12449
l2_rw_clash 4 107
l2_read_bytes 4 436386816
l2_write_bytes 4 30209695232
l2_writes_sent 4 8573
l2_writes_done 4 8573
l2_writes_error 4 0
l2_writes_hdr_miss 4 66
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 125724
l2_abort_lowmem 4 147
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 53856799744
l2_asize 4 29510425600
l2_hdr_size 4 140365400
l2_compress_successes 4 750806
l2_compress_zeros 4 0
l2_compress_failures 4 121692
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 10
memory_direct_count 4 2685
memory_indirect_count 4 541110
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 868295272
arc_meta_limit 4 6254573568
arc_meta_max 4 1914595304
# cat /proc/meminfo
MemTotal: 16287952 kB
MemFree: 174196 kB
Buffers: 7480 kB
Cached: 6161460 kB
SwapCached: 0 kB
Active: 1401292 kB
Inactive: 5920764 kB
Active(anon): 1157396 kB
Inactive(anon): 368260 kB
Active(file): 243896 kB
Inactive(file): 5552504 kB
Unevictable: 3496 kB
Mlocked: 3496 kB
SwapTotal: 8041468 kB
SwapFree: 8041468 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 1156612 kB
Mapped: 442508 kB
Shmem: 369996 kB
Slab: 201288 kB
SReclaimable: 65276 kB
SUnreclaim: 136012 kB
KernelStack: 7104 kB
PageTables: 39880 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16185444 kB
Committed_AS: 4728136 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 8211824 kB
VmallocChunk: 34346241704 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 274432 kB
DirectMap2M: 16410624 kB
The average write is still slow for most of the time, but it is made up by longer and more frequent write bursts with more throughput leading to the much better performance and there's no more blocking of applications for both SIGINT
and SIGSTOP
. In the meantime I added a (very fast) USB stick as cache without knowing what I'm doing and it didn't lead to any improvements both for the low performance on 3.16 and the good performance on 3.2. The above output has been produced after successful zpool offline samsung <id>
.
Some thoughts:
sdd 267,70 16404,40 25,60 164044 256
sdh 79,00 535,60 25,60 5356 256
So should sdh be the single drive on which you did your initial tests (without the mirror) then the difference would be explained...So in case you could run the exactly same test
under the different kernel versions and get different results for performance then we might have interesting data points for the devs to look at.
Thanks again for your input. I understand the difficulty to retrieve a test case from my information. Therefore before getting into providing a reproducible scenario I'll check some other things (not as a sort of support request, but just to further isolate the issue from another perspective than the tests you suggested above - before I get started with them), because hardware is hard...
I became curious and changed the setup of the pool as good as I can (afford ;)) and made the slow device the second device of the samsung
pool rather than the fast device mirror of the slow (which brought the write speed up to 210 MB/s average without compression and dedup) and bought a 500 GB SSD ata-Samsung_SSD_840_EVO_500GB_S1DKNWAFA00683Y
which I added as cache (with default values). Then I recreated the pool by copying data to an ext4 filesystem and then back.
The thing is that now - and that's what I think requires your attention again, @GregorKopka - that I can turn dedup on and compression off and have 210 MB/s speed and dedup off and compression on and have 210 MB/s speed. Only turning on both dedup and compression turned on (compression to gzip
and lz4
makes no difference) brings write speed breakdown again to 3 MB/s (the improvement in comparison to < 1 MS/s is due to the SSD cache, I guess). Doesn't that mean that the breakdown is due to the combination of dedup and compression rather than insufficient power of hardware (shouldn't it be rather trivial to compress the data after it has been deduped)?
@krichter722 I tried to reproduce your slowdown with dedup+compression, but failed.
System:
AMD Athlon(tm) II X2 260 Processor 3.2GHz, 16GB RAM
Linux version 3.7.10-gentoo (root@system) (gcc version 4.6.3 (Gentoo Hardened 4.6.3 p1.11, pie-0.5.2) ) #1 SMP Thu Jul 17 10:18:59 CEST 2014
sys-kernel/spl-0.6.3, sys-fs/zfs-kmod-0.6.3, sys-fs/zfs-0.6.3-r2
Test pool: raidz over 5 2TB 7200RPM drives, no L2ARC.
I setup a fresh dataset, then copied a video file 10 times onto it using
$ time for i in {01..10}; do dd bs=1M if=/
compression=lz4, dedup=on
733722624 bytes (734 MB) copied, 0,647128 s, 1,1 GB/s
733722624 bytes (734 MB) copied, 3,25016 s, 226 MB/s
733722624 bytes (734 MB) copied, 10,3316 s, 71,0 MB/s
733722624 bytes (734 MB) copied, 8,62427 s, 85,1 MB/s
733722624 bytes (734 MB) copied, 2,84357 s, 258 MB/s
733722624 bytes (734 MB) copied, 7,27558 s, 101 MB/s
733722624 bytes (734 MB) copied, 9,30951 s, 78,8 MB/s
733722624 bytes (734 MB) copied, 4,22566 s, 174 MB/s
733722624 bytes (734 MB) copied, 5,9336 s, 124 MB/s
733722624 bytes (734 MB) copied, 9,01479 s, 81,4 MB/s
real 1m1.494s
user 0m0.010s
sys 0m6.660s
load of 8.5, nearly no I/O on disk after 15 seconds
compression=lz4:
733722624 bytes (734 MB) copied, 0,752108 s, 976 MB/s
733722624 bytes (734 MB) copied, 2,27194 s, 323 MB/s
733722624 bytes (734 MB) copied, 3,96534 s, 185 MB/s
733722624 bytes (734 MB) copied, 4,34413 s, 169 MB/s
733722624 bytes (734 MB) copied, 4,57099 s, 161 MB/s
733722624 bytes (734 MB) copied, 4,61166 s, 159 MB/s
733722624 bytes (734 MB) copied, 4,10418 s, 179 MB/s
733722624 bytes (734 MB) copied, 4,30493 s, 170 MB/s
733722624 bytes (734 MB) copied, 5,4572 s, 134 MB/s
733722624 bytes (734 MB) copied, 4,04915 s, 181 MB/s
real 0m38.586s
user 0m0.010s
sys 0m8.430s
saturated writes (lasting till 6 seconds after completion)
dedup=on:
733722624 bytes (734 MB) copied, 0,786477 s, 933 MB/s
733722624 bytes (734 MB) copied, 2,42359 s, 303 MB/s
733722624 bytes (734 MB) copied, 13,3954 s, 54,8 MB/s
733722624 bytes (734 MB) copied, 4,64372 s, 158 MB/s
733722624 bytes (734 MB) copied, 4,27365 s, 172 MB/s
733722624 bytes (734 MB) copied, 5,809 s, 126 MB/s
733722624 bytes (734 MB) copied, 7,66382 s, 95,7 MB/s
733722624 bytes (734 MB) copied, 8,54749 s, 85,8 MB/s
733722624 bytes (734 MB) copied, 1,98384 s, 370 MB/s
733722624 bytes (734 MB) copied, 7,40234 s, 99,1 MB/s
real 0m56.948s
user 0m0.030s
sys 0m6.680s
nearly no I/O on disks after 15 seconds
Vanilla dataset (no compression/dedup)
733722624 bytes (734 MB) copied, 0,698182 s, 1,1 GB/s
733722624 bytes (734 MB) copied, 2,42061 s, 303 MB/s
733722624 bytes (734 MB) copied, 3,53357 s, 208 MB/s
733722624 bytes (734 MB) copied, 4,48431 s, 164 MB/s
733722624 bytes (734 MB) copied, 4,09832 s, 179 MB/s
733722624 bytes (734 MB) copied, 3,68015 s, 199 MB/s
733722624 bytes (734 MB) copied, 4,13058 s, 178 MB/s
733722624 bytes (734 MB) copied, 4,29526 s, 171 MB/s
733722624 bytes (734 MB) copied, 3,5893 s, 204 MB/s
733722624 bytes (734 MB) copied, 4,37277 s, 168 MB/s
real 0m35.392s
user 0m0.040s
sys 0m8.200s
saturated writes (lasting till 8 seconds after completion)
So real time has been:
vanilla: 0m35.392s
lz4: 0m38.586s
dedup: 0m56.948s
dedup+lz4: 1m01.494s
Looks to me that the overhead of compression and dedup roughly stacks.
Granted: the data i wrote fits into RAM, was completely freshly written, DDT dito (so there was no need for zfs to seek on-disk to read).
Technically afaik dedup stage is after the data has been compressed (so only working on on-disk data).
I've regularrily experienced a breakdown of write performance to ~0.2 % of the initial speed for different vdevs (sparse files on ext4 filesystem as well as zpool on entire disk) after the device performed incredibly well at the beginning. The symptoms are identical for all vdevs:
mv
,cp
,rsync
andnautilus
used for file transferSIGINT
tomv
,cp
,rsync
or pressing abort button innautilus
) causes the processes to stay in state "uninterruptible" and blocking for up to 1 hour finishing the write (even for 1 GB of data transfered in 5 seconds in the best case scenario I experienced). The same happens from file to file after a transfer almost has completed (rsync
andnautilus
show 100 % completion (might be 99.999%)). Right before the process returns (successfully or with returncode != 0 depending on whether it has been aborted or not) the write performance is the usual, i.e. the last 200 MB are usually written in a burst/peak after this 1 hour of starving at < 1MB/s read/write).After initializing the pool I copied 400 GB of a 1TB file with 200 MB/s (high CPU and memory usage, but incredibly great performance), then a breakdown occured.
Further information
data
andrpool1
are on partitions on the same disk,samsung
is on a SAMSUNG HD103SI 1TB HDD connected via USB 3.0 andwd2
is a 5TB sparse file on a ext4 filesystem on a WD WD50EFRX 5TB HDD connected via USB 3.0 as well. The pool on the sparse file might cause issues, but it isn't involved directly as transfer source or target (see #2880 for issues directly related).I'm using 0.6.3 on Ubuntu 14.10 amd64 with Linux 3.16.7 and tested with 3.16.0-24-generic as well.