openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

MFU: data should be accessed more than twice to be in MFU (ARC and thus L2ARC) #16499

Open tkittich opened 2 months ago

tkittich commented 2 months ago

Describe the feature would like to see added to OpenZFS

ZFS should be more adaptive when moving data from MRU into MFU. To be moved into MFU, data should be accessed maybe 3 times, or a user-configurable number of times, or an auto-self-tuned number of times.

How will this feature improve OpenZFS?

Since the MRU is also acting as a write cache, data that is written and then read could easily get counted twice and moved into MFU. That would ruin the data that could be more valuable in the MFU. Twice data access may be too common. For example, many large files are downloaded and then read only once during a system upgrade, a file server in a cluster downloads and then distributes files to other nodes, a user searches in large files twice, etc. Maybe the first write shouldn't get counted as an access. Maybe this could be easily prevented by using 3 times as the criteria instead. Data that is accessed less than 3 times stays in the MRU.

The number of times could be user configurable, or, best, be adaptive and auto self-tuned. Maybe the number of access to each data in MRU could be counted and used for self-tuning. Maybe MAX(3, average or max number of access to the last 1% accesses to MRU)? Using the last 1% could quickly adapt to changing workloads. Say, there're 100,000 blocks of data in MRU, only the last 1,000 accesses could be used.

How to trigger the problem?

fio could be used to write 20 GiB test data and then read to make twice access and ruined the MFU and MRU. MFU data size increased suddenly after fio, presumably with the test data that would never be needed. MRU data size decreased suddenly as well.

                  | ARC size | MFU data size | MRU data size
before            | 54.1 GiB |    26.9 GiB   | 25.3 GiB

fio --rw=read --bs=1M --ioengine=libaio --iodepth=1 --size=20G --loops=1 --group_reporting --filename=./testxx --name=job1 --offset=0G
after fio         | 52.7 GiB |    46.6 GiB   | 4.3 GiB

rm ./testxx
after rm          | 31.5 GiB |    25.3 GiB   | 4.5 GiB
tkittich commented 2 months ago

If the first write doesn't get counted, maybe using MAX(2, (median of access count of the last 1% accesses to MRU) + 1) as the criteria to move from MRU to MFU could be scan-resistant for a few times. The current implementation is only 1 pass scan-resistant.

shodanshok commented 1 month ago

I think that doing what described here is not going to be very useful.

When writing, ZFS buffers as much as zfs_dirty_data_max bytes into anon memory, taken (mostly) from MRU. Then, when the writeout happens, these buffers become again part of MRU. From the man page:

Once this limit is exceeded, new writes are halted until space frees up ... Defaults to physical_ram/10, capped at zfs_dirty_data_max_max

That means that even a write-centric workload can not really "purge" the MFU, as both throttling (see zfs_delay_min_dirty_percent) and capping (see above) are applied by ZFS. Moreover, to pollute MFU with this kind of write-then-read workload, the written data should be entirely contained into MRU first - otherwise, if only a part is cached, it would only trash MRU itself.

Finally, it is my understanding that ZFS does not really count per-block cache hits. Rather, it has two lists: MRU and MFU. If anything already stored in MRU is demand read, than it is moved into MFU. Adding such counters would be a significant change, complicating and slowing down a very time critical path (ARC read), which is already much slower than linux pagecache.

tkittich commented 1 month ago

That means that even a write-centric workload can not really "purge" the MFU, as both throttling (see zfs_delay_min_dirty_percent) and capping (see above) are applied by ZFS. Moreover, to pollute MFU with this kind of write-then-read workload, the written data should be entirely contained into MRU first - otherwise, if only a part is cached, it would only trash MRU itself.

I am not that familiar with ZFS so please correct me if I miss something. But it seems new data buffers are always added to MRU except when set to not cache. With this kind of write-then-read workload, all data of the files will be in MRU after the first write. When the files are later read from the MRU, they will all get promoted to MFU and would ruin more valuable data there.

Finally, it is my understanding that ZFS does not really count per-block cache hits. Rather, it has two lists: MRU and MFU. If anything already stored in MRU is demand read, than it is moved into MFU. Adding such counters would be a significant change, complicating and slowing down a very time critical path (ARC read), which is already much slower than linux pagecache.

There seems to be mfu_hits count and mru_hits count too. But if counting is too much, maybe a simple dontpromote flag could be added when data buffer is first added to MRU by a write. The first read later would just clear this flag and then the next read would promote to MFU.

amotin commented 1 month ago

This feels subjective to me, but not impossible. I guess in some write-only scenarios it could be beneficial to even consider just-written data as a separate state (uncached?) to evict immediately, or with new separate size and ghost state for auto-adaptation. Needs thinking.

Meanwhile I'd like to mention, that aside of just second access there is a second factor for promotion currently -- time since the last access, which should be more than at least 62ms ago (see ARC_MINTIME), which I suppose should filter out multiple accesses that are really parts of one workload.

shodanshok commented 1 month ago

Disclaimer: this is my own understanding of how ARC works, which can be very well wrong or incomplete.

I am not that familiar with ZFS so please correct me if I miss something. But it seems new data buffers are always added to MRU except when set to not cache.

This is correct.

With this kind of write-then-read workload, all data of the files will be in MRU after the first write. When the files are later read from the MRU, they will all get promoted to MFU and would ruin more valuable data there.

Not "all data", only the tail of what you wrote which fits into MRU: MFU is not reduced on write unless you buffer writes for more than current MRU size (and write buffer is capped by zfs_dirty_data_max). If you re-read that part, sure, it will be promoted to MFU. But if you re-read all your data from the start, nothing will be promoted to MFU.

Sure, the "tail" can be so big to actually include all your data - in case of large MRU this is a possibility, and this seem what happened on your test (25G MRU vs 20G fio write size). However, please note that even after writing 20G, your MFU shrunk (after file remove) by 1.5G only.

An example on a test machine:

# import and fill MFU, ARC is capped at 2G
[root@localhost ~]# zpool import tank
[root@localhost ~]# dd if=/tank/test.img of=/dev/null bs=1M count=1900
1900+0 records in
1900+0 records out
1992294400 bytes (2.0 GB, 1.9 GiB) copied, 0.551855 s, 3.6 GB/s
[root@localhost ~]# dd if=/tank/test.img of=/dev/null bs=1M count=1900
1900+0 records in
1900+0 records out
1992294400 bytes (2.0 GB, 1.9 GiB) copied, 0.271027 s, 7.4 GB/s
[root@localhost ~]# arc_summary -p 1 | grep "ARC size\|Max size\|MRU data size\|MFU data size"
ARC size (current):                                    95.9 %    1.9 GiB
        Max size (high water):                           17:1    2.0 GiB
        MFU data size:                                 99.9 %    1.9 GiB
        MRU data size:                                < 0.1 %  512 Bytes

# write 1.9G and check ARC, MRU is at 212M
[root@localhost ~]# dd if=/dev/urandom of=/tank/test2.img bs=1M count=1900 status=progress
1643118592 bytes (1.6 GB, 1.5 GiB) copied, 3 s, 547 MB/s
1900+0 records in
1900+0 records out
1992294400 bytes (2.0 GB, 1.9 GiB) copied, 3.6506 s, 546 MB/s
[root@localhost ~]# arc_summary -p 1 | grep "ARC size\|Max size\|MRU data size\|MFU data size"
ARC size (current):                                    99.5 %    2.0 GiB
        Max size (high water):                           17:1    2.0 GiB
        MFU data size:                                 89.4 %    1.8 GiB
        MRU data size:                                 10.4 %  212.2 MiB

# re-reading the just written file does not change anything, MRU is at 219M
[root@localhost ~]# dd if=/tank/test2.img of=/dev/null bs=1M count=1900
1900+0 records in
1900+0 records out
1992294400 bytes (2.0 GB, 1.9 GiB) copied, 0.631887 s, 3.2 GB/s
[root@localhost ~]# dd if=/tank/test2.img of=/dev/null bs=1M count=1900
1900+0 records in
1900+0 records out
1992294400 bytes (2.0 GB, 1.9 GiB) copied, 0.595442 s, 3.3 GB/s
[root@localhost ~]# arc_summary -p 1 | grep "ARC size\|Max size\|MRU data size\|MFU data size"
ARC size (current):                                    99.8 %    2.0 GiB
        Max size (high water):                           17:1    2.0 GiB
        MFU data size:                                 89.1 %    1.8 GiB
        MRU data size:                                 10.8 %  219.2 MiB

There seems to be mfu_hits count and mru_hits count too.

Interesting. I completely missed that, thank you for reporting.

But if counting is too much, maybe a simple dontpromote flag could be added when data buffer is first added to MRU by a write. The first read later would just clear this flag and then the next read would promote to MFU.

Maybe the ARC_PREFETCH flag could be reused for that (it actually prevent MFU promotion for prefetched buffers in a very similar manner).