openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

High zvol utilization - read performance #3888

Closed zielony360 closed 5 years ago

zielony360 commented 9 years ago

Hello,

I am using zfs 0.6.4.2 on kernel 4.0.4. I share zvols through FC target. The pool: 5 x raidz2 (10 disk each). Pool space usage: 29%.

# zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
backup   136T  40,5T  95,8T         -    24%    29%  1.00x  ONLINE  -

The problem is in read performance, which generally was much greater, but now during copying file from one LUN to the another there is high zvol utilization causing low read throughput:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd0               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
zd16              0,00     0,00  108,00    0,00    54,00     0,00  1024,00     4,60   42,19   42,19    0,00   8,70  94,00
zd32              0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
zd48              0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
zd64              0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
zd80              0,00     0,00    0,00   64,00     0,00    32,00  1024,00     0,05    0,75    0,00    0,75   0,12   0,80

The physical disks (3 TB 7200 RPM) are not overloaded:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0,00     0,00   71,00   70,00     1,11     1,54    38,46     0,96    6,84   10,70    2,91   2,21  31,20
dm-1              0,00     0,00   71,00   64,00     1,11     1,54    40,14     0,88    6,52   10,59    2,00   1,87  25,20
dm-2              0,00     0,00   73,00  153,00     1,13     2,14    29,56     1,48    6,53   18,14    0,99   1,59  36,00
dm-3              0,00     0,00   52,00  127,00     0,81     1,68    28,52     0,88    3,66    8,31    1,76   2,59  46,40
dm-4              0,00     0,00   76,00  138,00     1,17     2,14    31,73     0,69    3,21    6,89    1,19   1,29  27,60
dm-5              0,00     0,00   83,00   57,00     1,30     1,89    46,61     0,59    4,20    6,46    0,91   1,60  22,40
dm-6              0,00     0,00   53,00  119,00     0,83     1,68    29,85     0,75    3,93    7,47    2,35   2,37  40,80
dm-7              0,00     0,00   81,00   67,00     1,23     1,89    43,18     0,72    4,84    8,49    0,42   1,81  26,80
dm-8              0,00     0,00   68,00   65,00     1,06     1,54    40,06     0,43    3,25    4,82    1,60   1,47  19,60
dm-9              0,00     0,00   78,00   62,00     1,20     1,89    45,26     0,67    4,80    8,26    0,45   1,80  25,20
dm-10             0,00     0,00   67,00   73,00     1,05     1,54    37,77     0,46    3,26    5,61    1,10   1,14  16,00
dm-11             0,00     0,00   81,00   76,00     1,25     1,89    40,92     0,94    5,96   10,42    1,21   2,32  36,40
dm-12             0,00     0,00   66,00   65,00     1,03     1,54    40,20     0,80    6,14   10,00    2,22   1,92  25,20
(...)

arcstat:

# arcstat.py 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
13:50:53     0     0      0     0    0     0    0     0    0    62G   62G  
13:50:54   573   535     93   535   93     0    0     5   20    62G   62G  
13:50:55   643   610     94   610   94     0    0     4   22    62G   62G  
13:50:56   769   742     96   742   96     0    0     6  100    62G   62G  
13:50:57   894   841     94   841   94     0    0     7   24    62G   62G  
13:50:58   562   513     91   513   91     0    0     3   18    62G   62G  
13:50:59   492   452     91   452   91     0    0     4   14    62G   62G  
13:51:00   845   805     95   805   95     0    0     5   27    62G   62G  
13:51:01   483   449     92   449   92     0    0     1    5    62G   62G  
13:51:02   744   705     94   705   94     0    0     1    6    62G   62G  
13:51:03   420   387     92   387   92     0    0     3   20    62G   62G  
13:51:04   631   597     94   597   94     0    0     5   27    62G   62G  
13:51:05   499   470     94   470   94     0    0     4   22    62G   62G  
13:51:06   536   499     93   499   93     0    0     5   20    62G   62G  
13:51:07   612   582     95   582   95     0    0     6   35    62G   62G  
13:51:08  1.8K   677     36   677   36     0    0     5   26    62G   62G  
13:51:09   666   645     96   645   96     0    0     5  100    62G   62G 

Writes are being flushed by NTFS, which used zvols, every 6 seconds.

Can you suggest some parameter tuning or it lies deeper?

Best regards, Tomasz Charoński

kernelOfTruth commented 9 years ago

related to https://github.com/zfsonlinux/zfs/issues/3871 Highly uneven IO size during write on Zvol ?

CC: @ryao

@zielony360 are you observing this behavior right from the start upon having booted up the computer, server ?

any special configuration underneath the zvols or zpools ?

cryptsetup ?

zielony360 commented 9 years ago

@kernelOfTruth Doesn't seem to be related to #3871, because workload is distributed quite equal across the physical disks (I attached iostat with dm-). Regarding #3871 - all are set as rotational.

uptime:

# uptime
 14:13:16 up 53 days, 20 min,  1 user,  load average: 5,31, 6,79, 6,90

So the problem isn't connected to prefetching metadata to ARC after reboot.

Pool configuration: built on SAS disks in 2 JBODS, using multipath round robin to physical disks. lz4 enabled. Changed zfs parameters:

options zfs zfs_vdev_async_write_active_min_dirty_percent=15
options zfs zfs_vdev_async_write_active_max_dirty_percent=40
options zfs zfs_delay_min_dirty_percent=40
options zfs zfs_metaslab_fragmentation_threshold=80
options zfs zfs_mg_fragmentation_threshold=90
options zfs zfs_delay_scale=4000000
options zfs zio_delay_max=60000
options zfs zfs_top_maxinflight=128

I was observing very high pm% in arcstat, so I disabled prefetching now, but it didin't help.

All zvols have 128k volblocksize, because the operations have large blocks generally (as in this case). Compression ratio from 1.5x to 2.0x. All zvols are sparse.

I also set zfs_vdev_async_read_max_active to 10 - no change.

cryptsetup - no.

There is no problem in writes, which go up to 1 GB/s.

Best regards, Tomasz Charoński

kernelOfTruth commented 9 years ago

which generally was much greater, but now during copying file from one LUN to the another there is high zvol utilization causing low read throughput:

in reference to what state ?

what did change ?

I am using zfs 0.6.4.2 on kernel 4.0.4

was that supposed to mean

I am using zfs 0.6.5.2 on kernel 4.0.4

or are you really running 0.6.4.2 ?

0.6.5.2 has major improvements related to zvols: https://github.com/zfsonlinux/zfs/releases

Improved zvol performance on all kernels (>50% higher throughput, >20% lower latency)

which could help with your problem

zielony360 commented 9 years ago

@kernelOfTruth

in reference to what state ?

When I was testing before production period reading with dd with big block size, it was hunderds of GBs.

what did change ?

If it's going about configuration, there is no change.

or are you really running 0.6.4.2 ?

Yes, I am using 0.6.4.2.

0.6.5.2 has major improvements related to zvols: https://github.com/zfsonlinux/zfs/releases

I know, but I saw many kernel panics here at Issues, even on 0.6.5.2, so I am waiting for bugfixes. :] That's production backup system, so I cannot afford crashes.

kernelOfTruth commented 9 years ago

@zielony360 alright :)

Could you please post output of

cat /proc/meminfo

and

cat /proc/spl/kstat/zfs/arcstats

that perhaps might give some more diagnostic information

If the others don't have any quick idea on how to improve this

the IRC channel might be your best bet - perhaps someone there has encountered similar behavior

Thanks

zielony360 commented 9 years ago
# cat /proc/meminfo 
MemTotal:       132005272 kB
MemFree:        53571392 kB
MemAvailable:   54253168 kB
Buffers:          314672 kB
Cached:          2605332 kB
SwapCached:            0 kB
Active:          2041444 kB
Inactive:        1037044 kB
Active(anon):    1494084 kB
Inactive(anon):   933092 kB
Active(file):     547360 kB
Inactive(file):   103952 kB
Unevictable:       17244 kB
Mlocked:           17244 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               104 kB
Writeback:             0 kB
AnonPages:        175736 kB
Mapped:            99132 kB
Shmem:           2265020 kB
Slab:           32763416 kB
SReclaimable:     368336 kB
SUnreclaim:     32395080 kB
KernelStack:       20256 kB
PageTables:         6088 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    66002636 kB
Committed_AS:    3810368 kB
VmallocTotal:   34359738367 kB
VmallocUsed:    42517616 kB
VmallocChunk:   34221413700 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      347452 kB
DirectMap2M:     8007680 kB
DirectMap1G:    127926272 kB
# cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 14764790262 4583395777289310
name                            type data
hits                            4    2061837747
misses                          4    931645936
demand_data_hits                4    1639776797
demand_data_misses              4    658114871
demand_metadata_hits            4    354773563
demand_metadata_misses          4    21667444
prefetch_data_hits              4    58288956
prefetch_data_misses            4    248236916
prefetch_metadata_hits          4    8998431
prefetch_metadata_misses        4    3626705
mru_hits                        4    963264829
mru_ghost_hits                  4    3971979
mfu_hits                        4    1033863118
mfu_ghost_hits                  4    59056799
deleted                         4    1612576134
recycle_miss                    4    6110122
mutex_miss                      4    329739
evict_skip                      4    72342052
evict_l2_cached                 4    0
evict_l2_eligible               4    193158135966208
evict_l2_ineligible             4    24420786919424
hash_elements                   4    4982568
hash_elements_max               4    4984279
hash_collisions                 4    470209837
hash_chains                     4    608747
hash_chain_max                  4    7
p                               4    63335796736
c                               4    67586699264
c_min                           4    4194304
c_max                           4    67586699264
size                            4    67584032792
hdr_size                        4    1811491320
data_size                       4    35206856704
meta_size                       4    29807262208
other_size                      4    758422560
anon_size                       4    114216960
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    63219375104
mru_evict_data                  4    35090857984
mru_evict_metadata              4    27720315904
mru_ghost_size                  4    4382785536
mru_ghost_evict_data            4    4382785536
mru_ghost_evict_metadata        4    0
mfu_size                        4    1680526848
mfu_evict_data                  4    0
mfu_evict_metadata              4    1478018560
mfu_ghost_size                  4    63235949568
mfu_ghost_evict_data            4    33753268224
mfu_ghost_evict_metadata        4    29482681344
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    1572864
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    32377176088
arc_meta_limit                  4    50690024448
arc_meta_max                    4    36014735824

@kernelOfTruth Thank you for your being involved. ;-)

kernelOfTruth commented 9 years ago

@zielony360 sure :)

the next step I would test would be to set primarycache=metadata , whether that would improve performance is another question ;)

if there's lots of copying ( +/- rsync) involved - disabling it isn't recommended for most cases according to several sources I read (and only in special cases could improve performance or reduce the needed data to be read)

but your box is clearly better outfitted (132GB ECC RAM) than mine (32 GB) and your i/o is also much higher - so other approaches would be needed ...

If lots of data in a "oneshot" way is transferred then there would be two potential "helpers"

https://github.com/Feh/nocache

and

POSIX_FADV_DONTNEED http://insights.oetiker.ch/linux/fadvise/ , https://tobi.oetiker.ch/patches/ (rsync with --drop-cache option )

after that I'm out of ideas :P (well at least for now)

zielony360 commented 9 years ago

@kernelOfTruth

the next step I would test would be to set `primarycache=metadata , whether that would improve performance is another question ;)

Are you sure caching only metadata would be good, looking at ARC statistics? We can see there is a lot of data hits, more than metadata. We will lose data caching (we don't have L2ARC).

Unfortunately, we can't use that solutions, because zvols are used by Windows machines. Normally Veeam is working there, but now we are migrating some backups from one LUN to the another at NTFS level (basic copy/paste). Veeam has also bad read performance while restoring VMs, so it can be connected.

kernelOfTruth commented 9 years ago

@zielony360

Are you sure caching only metadata would be good, looking at ARC statistics? We can see there is a lot of data hits, more than metadata. We will lose data caching (we don't have L2ARC).

whether that would improve performance is another question ;)

no - that's just what I would do to gain experience how ZFS behaves (obviously I've not that much experience with operation and internals yet; also this is a home desktop, power user workstation - so the after-affects would be negligible - servers of course have much higher requirements of reliability and consistency in operation),

therefore that pointer to IRC

chain of thoughts were that the issues could be related to zvols (-> 0.6.5.2 ? pre-0.6.5.y ?), metadata (caching ?, rsync ? ARC fighting with pagecache ?) and potentially others (now that you're mentionig it again: NTFS ?)

If you're using NTFS and ntfs-3g driver (fuse) make sure to look into the big_writes , max_readahead and other options

sorry when potentially pointing you in a wrong direction

ryao commented 9 years ago

You probably are seeing read-modify-write overhead on zvols where NTFS uses a 4KB default cluster size while ZFS zvols default to volrecordsize=8K. Either use zvols created with volrecordsize=4K or format NTFS to use a cluster size of 8KB and performance should improve. The volrecordsize=8K default is an artifact from SPARC where the memory page is 8KB.

zielony360 commented 9 years ago

@ryao

You probably are seeing read-modify-write overhead on zvols where NTFS uses a 4KB default cluster size while ZFS zvols default to volrecordsize=8K. Use zvols created with volrecordsize=4K and performance should improve.

No. Look at iostat in my first post. One LUN is only reading and the second one only writing - simple copy/paste. It uses 512 kB IO block (avgrq-sz / 2). So the block size is not a case here.

@kernelOfTruth

and potentially others (now that you're mentionig it again: NTFS ?) If you're using NTFS and ntfs-3g driver (fuse) make sure to look into the big_writes , max_readahead and other options

Yes, NTFS on Windows 2012 through FC target (I mentioned it already), no on Linux.

There is no rsync anywhere. Simple Windows copy/paste in this case.

@kernelOfTruth, @ryao The case is, IMO, in ZFS stack, because zvol utilization is almost 100%, but disks utilization is about 50%. CPU is also not a case - 2 x 10-core E5-2690 v2 @ 3.00GHz, so 20 cores, 40 threads. Disks splitted in 2 JBODS (each has 2 SAS expanders inside), 2 HBAs, 4 6Gb/s SAS cables.

ryao commented 9 years ago

@zielony360 I suggest running perf top on this system (the right way to do what I have in mind is to do perf record -F 997 -ag and generate a flame graph, but perf top should be sufficient). You should see the majority of time is being spent in spinlocks. Specifically, ->zv_lock, but perf top won't show that much detail. The time spent in spinlocks is due to zvols implemented in older versions of ZoL by a spinlock protected request queue like physical hardware uses. This has the effect of hurting latency and throughput. The more concurrent a system is, the worst this effect becomes. This effect should be far worse on a 40 thread system than it was on my 8 thread development system.

You could try backporting the appropriate commits to 0.6.4.2 to deal with this:

https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 https://github.com/zfsonlinux/zfs/commit/fa56567630cfad95b77dee507595d77f24e99cb9 https://github.com/zfsonlinux/zfs/commit/8198d18ca7e05ec30139e10d658a8b06c0ec4c55 https://github.com/zfsonlinux/zfs/commit/f52ebcb3eb1630694648237ad6d439c226c3dd99 (very important)

The first does not apply cleanly to 0.6.4, but the conflicts are trivial to handle. I took the liberty of doing that here:

https://github.com/ryao/zfs/tree/zfs-0.6.4.2-zvol

ryao commented 9 years ago

@zielony360 Did you try the backports?

zielony360 commented 8 years ago

@ryao @kernelOfTruth Thank you for your effort. Few days ago I upgraded to 0.6.5.7. The latency on zvols is similiar as previously, however overall performance is better. But read latency is still about 50-100 ms during operations on each zvol. I believe it can be related to high metadata miss:

hits                            4    135093787
misses                          4    68967196
demand_data_hits                4    130136993
demand_data_misses              4    60022622
demand_metadata_hits            4    4956794
demand_metadata_misses          4    8944561

So a little above 50% hit rate of metadata. This is not the problem of lacking memory:

arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1239474008
arc_meta_limit                  4    50690024448
arc_meta_max                    4    2451519088
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    2112081920

I think metadata is being evicted too much (zfs_arc_meta_strategy is 1). Is there a way to tune this, so in the effect I will have better metadata hit rate? primarycache=metadata is the last option for me, because as you can see above, I have 68% data hit from ARC. I do not have L2ARC nor SLOG.

Currently tuned ZFS parameters:

options zfs zfs_vdev_async_write_active_min_dirty_percent=15
options zfs zfs_vdev_async_write_active_max_dirty_percent=40
options zfs zfs_delay_min_dirty_percent=40
options zfs zfs_metaslab_fragmentation_threshold=80
options zfs zfs_mg_fragmentation_threshold=90
options zfs zfs_delay_scale=4000000
options zfs zio_delay_max=60000
options zfs zfs_prefetch_disable=1
# zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
backup   136T  69,6T  66,6T         -    35%    51%  1.00x  ONLINE  -

Compression lz4 ratio 1,5x on pool. We are using only zvols through FC target.

dweeezil commented 8 years ago

@zielony360 Looking at your initial iostat output, I'm curious as to how you managed to get avgrq-sz of 1024 (512KiB). Normally it will be capped on the block device's maximum readahead size which defaults to 128KiB. Did you do a blockdev --setra on your zvols? What do you see in, for example, cat /sys/block/zd16/queue/read_ahead_kb. I'm asking because I wonder if the large request size could be contributing to the latency.

zielony360 commented 8 years ago

@dweeezil I haven't changed readahead. But is read_ahead_kb this parameter which decides about max IO size? I didn't think that too big size, which will be divided later, will be bad for performance.

:/sys/block/zd32/queue# for i in *; do echo -n "$i: "; cat $i; done
add_random: 0
discard_granularity: 131072
discard_max_bytes: 2147483648
discard_zeroes_data: 0
hw_sector_size: 512
iostats: 0
logical_block_size: 512
max_hw_sectors_kb: 16384
max_integrity_segments: 0
max_sectors_kb: 16384
max_segments: 65535
max_segment_size: 4294967295
minimum_io_size: 131072
nomerges: 0
nr_requests: 128
optimal_io_size: 131072
physical_block_size: 131072
read_ahead_kb: 128
rotational: 0
rq_affinity: 0
scheduler: none
write_same_max_bytes: 0

targetcli attributes of all zvols:

            block_size 512 
            emulate_3pc yes 
            emulate_caw yes 
            emulate_dpo no 
            emulate_fua_read no 
            emulate_fua_write yes 
            emulate_model_alias no 
            emulate_rest_reord no 
            emulate_tas yes 
            emulate_tpu yes 
            emulate_tpws no 
            emulate_ua_intlck_ctrl no 
            emulate_write_cache no 
            enforce_pr_isids yes 
            fabric_max_sectors 8192 
            force_pr_aptpl 0 
            is_nonrot yes 
            max_unmap_block_desc_count 1 
            max_unmap_lba_count 65535 
            max_write_same_len 65535 
            optimal_sectors 20480 
            pi_prot_format 0 
            pi_prot_type 0 
            queue_depth 16 
            unmap_granularity 256 
            unmap_granularity_alignment 0 
        }
dweeezil commented 8 years ago

@zielony360 What's your storage stack like? In particular, what's actually using the zvols? You mentioned NTFS. Is it ntfs-3g? Or something else?

dweeezil commented 8 years ago

@zielony360 Oh, I think I see. You're exporting the zvols as iSCSI targets. Is the NTFS access from regular Windows clients?

zielony360 commented 8 years ago

@dweeezil No, we are exporting zvols through Fiber Channel to Windows, which has its NTFS on it. It is used for Veeam backups. So generally large (about 512 kB/s) IOs, 70% async writes, 30% reads.

dweeezil commented 8 years ago

@zielony360 Got it. Do you see the high read latency for read-only workloads? Or does it require a mix of reads and writes?

zielony360 commented 8 years ago

@dweeezil Currently we are copying file on NTFS from one LUN to the another. Then we will use this to read latency test, but look at it now:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,18    0,00    1,39    3,31    0,00   95,12

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd16              0,00     0,00  112,28    0,00    56,03     0,00  1022,02     4,94   43,95   43,95    0,00   8,36  93,89
zd32              0,00     0,00    0,00  111,30     0,00    55,24  1016,42     0,26    2,30    0,00    2,30   0,95  10,55

Those are the only IOs at the moment on pool. High latency on reads, low latency on writes.

arcstat:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
13:51:27  1.0K   648     63   648   63     0    0    70   33    63G   62G
14:01:27  1.1K   667     60   667   60     0    0    75   35    62G   62G
14:11:27  1.1K   775     72   775   72     0    0    42   34    62G   62G
14:21:28   862   660     76   660   76     0    0    12   39    62G   62G
14:31:28   600   475     79   475   79     0    0     9   38    62G   62G
14:41:28   611   485     79   485   79     0    0     9   37    63G   62G
14:51:28   646   455     70   455   70     0    0    22   50    62G   62G
14:57:20   346   241     69   241   69     0    0    14   55    62G   62G
14:57:20     0     0    100     0  100     0    0     0    0    62G   62G
15:02:42   295   235     79   235   79     0    0     9   52    63G   62G
15:02:42     0     0    100     0  100     0    0     0    0    63G   62G 

You can see metadata miss is generally significant.

arcstats fragment:

arc_prune                       4    0
arc_meta_used                   4    848598744
arc_meta_limit                  4    50690024448
arc_meta_max                    4    2752493288
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    2112081920
dweeezil commented 8 years ago

@zielony360 Ugh, the perils of sticking my nose into a bunch of similar issues at once: I had totally overlooked the fact that your iostats were on the zvol devices and not on the underlying vdev disks. Your earlier arcstats showed a whole lot more metadata being cached:

data_size 4 35206856704 meta_size 4 29807262208 other_size 4 758422560

I presume the last bit of output (the one with arc_meta_used 848598744) you posted is from more shortly after a reboot? Do you know if the balance between data_size and meta_size was different under 0.6.4.2? Possibly caching a lot more metadata?

zielony360 commented 8 years ago

@dweeezil Yeah, iostats with zd* devices are zvols. The thing is that zvols have 100% utilization during reads, but physical disks - not. So due to this I think metadata misses may be a reason.

"meta_size 4 29807262208" is from 10 months ago, when we had 0.6.4.2. Last arcstats, from this month, are after upgrading to 0.6.5.7 and yes, there was a reboot during this, but it was over a week ago. From this time there was a plenty of daily backup cycles, so metadata should already resist in ARC, but it isn't (hit rate below 50%). In general I can say that 0.6.4.2 had better metadata feeding/not evicting policy than 0.6.5.7. As you can see in earlier posts, that was about 90% metadata hit rate comparing to today below 50%.

Overall performance is better on 0.6.5.7, but I believe that if we have similiar metadata hit rate to 0.6.4.2, it would be even much better.

zielony360 commented 8 years ago

One more thing. Please remember I disabled prefetch shortly before upgrade. Metadata hit rate coming from prefetch was good, but data - not. On 0.6.4.2 disabling imoroved performance a bit. Should I try to enable it now on 0.6.5.7?

I have also experience on Nexenta ZFS and it tells me it would be great to distinguish prefetch_disable parameter to data_prefetch_disable and metadata_prefetch_disable. I would leave metadata prefetch enabled and disable data prefetch.

zielony360 commented 8 years ago

@dweeezil @kernelOfTruth @ryao I was making some tests if metadata miss can cause high read latency. So I added one 200 GB SSD to the pool as a cache and set primarycache=metadata. As expected matadata miss from about 60% to <10%, but such setting made things worse: higher latencies. So I set primarycache back to "all" and performance came back to previous levels. L2ARC hit rate was and is very low: <10%.

What is strange, that 3 TB 7200 RPM disks are still not too much utilized, while zvols - are 100% busy (iostat). Example output about HDD disks:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              3,30     0,00   31,10   72,60     0,16     1,59    34,41     0,79    7,60   12,77    5,39   2,18  22,56
dm-1              3,70     0,00   25,50   56,00     0,12     1,52    41,33     0,74    9,08   12,36    7,59   2,59  21,08
dm-10             1,50     0,00   30,10   74,40     0,15     1,58    33,94     0,85    8,10   13,55    5,90   2,45  25,56
dm-11             3,80     0,00   29,80  217,30     0,15     1,47    13,44     3,53   13,12   46,81    8,50   2,48  61,28
dm-13             3,50     0,00   24,80   58,90     0,11     1,52    40,08     0,79    9,41   13,06    7,88   2,68  22,44
dm-14             3,60     0,00   24,70   53,10     0,12     1,52    43,16     0,65    8,38   11,68    6,85   2,50  19,48
dm-15             1,10     0,00   24,60   61,00     0,10     1,45    37,04     1,18   13,82   16,50   12,73   3,60  30,84
dm-16             1,90     0,00   30,60   76,70     0,15     1,58    33,17     0,79    7,38   13,52    4,93   2,11  22,64
dm-17             1,80     0,00   28,30   60,00     0,13     1,48    37,31     0,79    8,96    8,35    9,25   2,51  22,20
dm-18             1,70     0,00   27,50   60,20     0,13     1,48    37,73     0,86    9,82    8,39   10,48   2,62  23,00
dm-19             1,30     0,00   26,60   58,60     0,12     1,48    38,52     0,82    9,69    8,86   10,06   2,73  23,24

To remind I have 5 x raidz2 (10 disks each) in the pool. I think read latencies could be lower until HDDs are not fully utilized. Also look above at avgrq-sz, which with 128k volblocksize for each zvol could be greater (?).

# cat /proc/spl/kstat/zfs/arcstats 
6 1 0x01 91 4368 14750376509 1720663407183591
name                            type data
hits                            4    1295222328
misses                          4    700709334
demand_data_hits                4    1223555127
demand_data_misses              4    614111080
demand_metadata_hits            4    71667201
demand_metadata_misses          4    86598241
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    0
prefetch_metadata_misses        4    13
mru_hits                        4    638690873
mru_ghost_hits                  4    44559459
mfu_hits                        4    656531455
mfu_ghost_hits                  4    12586217
deleted                         4    1190911893
mutex_miss                      4    5322
evict_skip                      4    527902
evict_not_enough                4    5276
evict_l2_cached                 4    3576559772160
evict_l2_eligible               4    149751922885120
evict_l2_ineligible             4    8192
evict_l2_skip                   4    0
hash_elements                   4    4283461
hash_elements_max               4    4557941
hash_collisions                 4    193735520
hash_chains                     4    462626
hash_chain_max                  4    7
p                               4    63362530560
c                               4    67586699264
c_min                           4    33554432
c_max                           4    67586699264
size                            4    67610404512
hdr_size                        4    621472624
data_size                       4    65950842880
metadata_size                   4    614019072
other_size                      4    159147720
anon_size                       4    47857664
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    61678938112
mru_evictable_data              4    61144170496
mru_evictable_metadata          4    942080
mru_ghost_size                  4    5918584832
mru_ghost_evictable_data        4    106561536
mru_ghost_evictable_metadata    4    5812023296
mfu_size                        4    4838066176
mfu_evictable_data              4    4758568960
mfu_evictable_metadata          4    32768
mfu_ghost_size                  4    61667744256
mfu_ghost_evictable_data        4    55804297216
mfu_ghost_evictable_metadata    4    5863447040
l2_hits                         4    2626898
l2_misses                       4    113475855
l2_feeds                        4    367717
l2_rw_clash                     4    0
l2_read_bytes                   4    88977219584
l2_write_bytes                  4    2252755720192
l2_writes_sent                  4    340705
l2_writes_done                  4    340705
l2_writes_error                 4    0
l2_writes_lock_retry            4    91
l2_evict_lock_retry             4    39
l2_evict_reading                4    0
l2_evict_l1cached               4    1206440
l2_free_on_write                4    9611
l2_cdata_free_on_write          4    5277
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    386570286080
l2_asize                        4    175942882816
l2_hdr_size                     4    264922216
l2_compress_successes           4    29507528
l2_compress_zeros               4    0
l2_compress_failures            4    9114103
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    835584
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1659561632
arc_meta_limit                  4    50690024448
arc_meta_max                    4    18603087336
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    2112081920

@dweeezil We were also doing the test you suggested: 100% read. Latencies was high even with such IO profile. Not more than 40 MB/s read while random and not more than 150 MB/s while sequential.

I am all the time on 0.6.5.7 now.

zielony360 commented 8 years ago

I revealed why reads are needed during making backup. Veeam uses reverse incremental backup method, so data are firstly written to separate file (at NTFS level) and then are copied to another, bigger file. Unfortunately, ZFS zvols do not implement XCOPY (#4042), so it has to be copied physically.

Ideal workaround for this, until XCOPY is implemented, would be caching current writes to ARC/L2ARC. Do you have an idea if it can be tuned somehow for this purpose?