openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.64k stars 1.75k forks source link

arcstat_prefetch_data_hits is not incremented for MRU hits #12523

Open cmurphycode opened 3 years ago

cmurphycode commented 3 years ago

System information

Type Version/Name
Distribution Name Centos
Distribution Version 3.10
Kernel Version 3.10
Architecture x86_64
OpenZFS Version 0.7.12

Describe the problem you're observing

arcstats show a suspiciously large amount of prefetch misses for one-time sequential read workflows.

Describe how to reproduce the problem

NOTE: i logged this against the 0.7.12 version, but I don't see any code difference in this area. I observed this accounting issue during a particular IO pattern I was running, but can simplify to the following:

# create a pool and a zvol to test
zpool create prefetch sdd
zfs create -s -V 50G -b 64K prefetch/test1
# put in a fair amount of random data, I prefer the following:
openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | dd of=/dev/zvol/prefetch/test1 iflag=fullblock bs=65536 count=512000
# get the arc cleared
zpool export prefetch
zpool import prefetch
# run a sequential read once
dd if=/dev/zvol/prefetch/test1 iflag=direct bs=65536 count=65536 of=/dev/null
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB) copied, 23.1273 s, 186 MB/s
# and then again
dd if=/dev/zvol/prefetch/test1 iflag=direct bs=65536 count=65536 of=/dev/null
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB) copied, 3.3964 s, 1.3 GB/s

running arcstat while this test happens reveals what I mean. I left a one-sample gap in between the two runs:

 arcstat.py -f time,c,arcsz,read,hits,miss,dmis,mmis,mru,mfu,dread,mread,mfu,pread,phit,pmis,mrug,dhit,mfug,mhit,eskip 10
    time     c  arcsz  read  hits  miss  dmis  mmis   mru   mfu  dread  mread   mfu  pread  phit  pmis  mrug  dhit  mfug  mhit  eskip  
20:28:15   21G    38M     1     1     0     0     0     0     1      1     1     1      0     0     0     0     1     0     1      0  
20:28:25   21G   936M  4.9K  3.6K  1.3K     9     9  1.2K  2.3K   3.6K  2.6K  2.3K   1.3K     0  1.3K     0  3.6K     0  2.6K      0  
20:28:35   21G   2.6G   10K  7.9K  2.8K     2     2  2.7K  5.2K   7.9K  5.7K  5.2K   2.8K     0  2.8K     0  7.9K     0  5.7K      0  
20:28:45   21G   4.1G  9.3K  6.9K  2.5K     2     2  2.4K  4.5K   6.9K  4.9K  4.5K   2.5K     0  2.5K     0  6.9K     0  4.9K      0  
20:28:55   21G   4.1G     1     1     0     0     0     0     1      1     1     1      0     0     0     0     1     0     1      0  
20:29:05   21G   4.1G   26K   26K     8     8     8  6.6K   13K    19K   13K   13K   6.6K  6.6K     0     0   19K     0   13K      0  
20:29:15   21G   4.1G     1     1     0     0     0     0     1      1     1     1      0     0     0     0     1     0     1      0  

Notes: the first time we read the data, we have to do the IO to the vdevs, so it's slower. That's fine, but we should be getting nice prefetch benefits too. We can see in iostat that zfs is indeed trying quite hard to prefetch- note the avgqu-sz on the zd32 compared to the sdd (vdev).

08/30/21 21:15:42
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05    0.00    1.68    9.35    0.04   88.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.10    0.00    2.80     0.00    14.65    10.46     0.00    1.21    0.00    1.21   0.36   0.10
sdd               0.00     0.00 1325.30    0.00 162446.00     0.00   245.15    27.49   20.71   20.71    0.00   0.67  88.41
zd32              0.00     0.00 2537.40    0.00 162393.60     0.00   128.00     0.77    0.30    0.30    0.00   0.30  77.16

08/30/21 21:15:52
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    1.61   10.64    0.04   87.68

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.20    1.80     0.80    10.00    10.80     0.01    3.00   22.50    0.83   2.35   0.47
sdd               0.00     0.00 1457.40    0.00 180635.20     0.00   247.89    31.04   21.30   21.30    0.00   0.68  99.57
zd32              0.00     0.00 2820.60    0.00 180518.40     0.00   128.00     0.87    0.31    0.31    0.00   0.31  86.80

(For the second run through, there is no activity to the backend vdev because the arcstat is large enough to contain the entire dataset)

So this is working fine, but the accounting seems wrong. In the first test, we should be getting at least some prefetch hits, but we get essentially zero. We are getting 100% demand hit rate, which should likely be near-zero. In the second test, we get 100% prefetch hit rate, and we get some demand hits too. I think the second result makes sense as it just depends whether a prefetch got fired off before the demand read could be satisfied from the arc.

The first result, where no credit is given for prefetches, may be due to the following code: https://github.com/openzfs/zfs/blob/zfs-2.1-release/module/zfs/arc.c#L5577 https://github.com/openzfs/zfs/blob/zfs-2.1-release/module/zfs/arc.c#L6070

Note that any time we are going to increment the prefetch/demand hit/miss data/metadata counters, we have called arc_access immediately before. By my simplistic reading of https://github.com/openzfs/zfs/blob/zfs-2.1-release/module/zfs/arc.c#L5413 , in arc_access, we are clearing the prefetch flag from the arc entry in the MRU case. This seems to match with what I'm seeing -- the first time through, a prefetch buffer will be found in MRU and accounted for as regular MRU demand hit. The second time through, I guess the buffer will be in the MFU, where we do not clear the prefetch flag. This means the arc_read (if it is a prefetch) can correctly increment the prefetch hit counter.

Thoughts? I realize that true accounting of "credit" for prefetch vs demand is not necessarily a 100% clear distinction, but the way this works here definitely was not my expectation.

Thank you!

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

amotin commented 2 years ago

The prefetch and probably other related code was actually modified number of times after the mentioned 0.7.2. For example 891568c9907b3631f99a1079344bddf62ca70f56 from Mar 20 2021 started blocking prefetch reads on second read when all data are in the ARC, so the second read it should be accounted only as a demand hits.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.