openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
822 stars 72 forks source link

Does cache work? #114

Closed alanruttenberg closed 10 years ago

alanruttenberg commented 10 years ago

My testing setup: 3x1.5Tb western digital green in port multiplier box configured raidz1. Kingston HyperX 3K 120 GB SATA III 2.5 to sata III, two partitions, slog 8G, the rest allocated to cache. Disks are 60% full. Fast machine.

find command executed on root looking for non-existent file name. I would expect that after running this a few times it would become much faster. It doesn't. zpool iostat shows variable rates maybe 400k/disk/s - 1.5M/disk/s read on the WDs and <100k/s on the SSD.

Is this expected?

ilovezfs commented 10 years ago

Increase the size of ARC and maybe you will see a better result. You cannot control this with the sysctl yet.

https://github.com/zfs-osx/zfs/blob/master/module/zfs/arc.c#L4193

Lower the 3 to a 2 or 1, or comment out the line entirely for 0.

alanruttenberg commented 10 years ago

Even with 0, and zfs.l2arc_write_max: 33554432 zfs.l2arc_write_boost: 33554432

I'm not seeing much if any gain. Another example: Two consecutive disk utility verifications of a sparsebundle time machine image on the disk seem to hit the disk just as much. About 3G of cache was written the first time, and the cache is being read sometimes according to istat menus, but it doesn't seem to impact reads from the disks much.

I've 16G of ram and zdb says ARC limit set to (arc_c_max): 8589934592

I would expect that the second time the disk image was checked I would see little to no i/o to the disks.

zfs.arc_max: 0 zfs.arc_min: 0 zfs.arc_meta_used: 18977288 zfs.arc_meta_limit: 268435456 zfs.l2arc_write_max: 33554432 zfs.l2arc_write_boost: 33554432 zfs.l2arc_headroom: 2 zfs.l2arc_feed_secs: 1 zfs.l2arc_feed_min_ms: 200 zfs.l2arc_noprefetch: 1 zfs.l2arc_feed_again: 1 zfs.l2arc_norw: 0 zfs.anon_size: 182272 zfs.anon_metadata_lsize: 0 zfs.anon_data_lsize: 0 zfs.mru_size: 56010752 zfs.mru_metadata_lsize: 925696 zfs.mru_data_lsize: 54263808 zfs.mru_ghost_size: 1017466880 zfs.mru_ghost_metadata_lsize: 10309632 zfs.mru_ghost_data_lsize: 1007157248 zfs.mfu_size: 1007128576 zfs.mfu_metadata_lsize: 5164544 zfs.mfu_data_lsize: 1000342016 zfs.mfu_ghost_size: 56298496 zfs.mfu_ghost_metadata_lsize: 2558976 zfs.mfu_ghost_data_lsize: 53739520 zfs.l2c_only_size: 2315508224 zfs.vnops_osx_debug: 0

alanruttenberg commented 10 years ago

Here's the last few results of: zpool iostat -v zztop 60

           capacity     operations    bandwidth

pool alloc free read write read write


zztop 2.36T 1.72T 121 15 15.1M 62.3K raidz1 2.36T 1.72T 121 15 15.1M 62.3K disk6 - - 51 4 4.81M 57.7K disk5 - - 53 4 5.17M 57.5K disk7 - - 53 4 5.16M 57.1K logs - - - - - - disk1s3 128K 6.56G 0 0 0 0 cache - - - - - - disk1s2 3.33G 101G 45 2 5.52M 189K


           capacity     operations    bandwidth

pool alloc free read write read write


zztop 2.36T 1.72T 139 6 17.4M 11.9K raidz1 2.36T 1.72T 139 6 17.4M 11.9K disk6 - - 57 1 5.44M 18.9K disk5 - - 60 1 5.89M 18.6K disk7 - - 62 1 6.08M 19.0K logs - - - - - - disk1s3 128K 6.56G 0 0 0 0 cache - - - - - - disk1s2 3.35G 101G 13 3 1.55M 448K


           capacity     operations    bandwidth

pool alloc free read write read write


zztop 2.36T 1.72T 115 10 14.4M 20.5K raidz1 2.36T 1.72T 115 10 14.4M 20.5K disk6 - - 55 2 5.09M 32.9K disk5 - - 49 2 4.54M 31.9K disk7 - - 52 2 4.74M 32.9K logs - - - - - - disk1s3 128K 6.56G 0 0 0 0 cache - - - - - - disk1s2 3.35G 101G 44 0 5.49M 648


           capacity     operations    bandwidth

pool alloc free read write read write


zztop 2.36T 1.72T 106 9 13.2M 31.7K raidz1 2.36T 1.72T 106 9 13.2M 31.7K disk6 - - 50 2 4.46M 31.9K disk5 - - 47 2 4.35M 32.7K disk7 - - 49 2 4.36M 31.0K logs - - - - - - disk1s3 128K 6.56G 0 0 0 0 cache - - - - - - disk1s2 3.35G 101G 48 0 5.93M 5.92K


alanruttenberg commented 10 years ago

BTW, is it clear that I meant L2ARC? I did another test yesterday in which I had a script copy a 30G file 10 times in succession. Over the 90 minutes the test took, the L2ARC had 25G written to it, but only 1G read.

rottegift commented 10 years ago

On 17 Jan, 2014, at 19:02, Alan Ruttenberg notifications@github.com wrote:

copy a 30G file 10 times in succession.

zfs does not normally make prefetched data eligible for l2arc (with good reason; prefetching has almost no IOPS impact, and l2arc is almost entirely to reduce the number of IOPS going to rotating media).

Over the 90 minutes the test took, the L2ARC had 25G written to it, but only 1G read.

Tunings that aggressively populate the l2arc just fill up the l2arc (and the arc itself with pointers to l2arc data) with blocks that have almost no hope of cache hits.

Generally copying prefetch and transient bursts of blocks to l2arc only reduces the effective size of the arc and imposes wear on the l2arc media. With arc_max small by default and being fragile above a couple of GiB (especially with the master rather than xnuzones spl branch) trying to fill up l2arc space just because it’s there is likely to be counterproductive.

If you want to exercise your l2arc just to see it work, a workload involving unpacking and repacking huge archives of many small files should cause the zpool iostat -v counters to rise; keep an eye on the operations columns rather than the bandwidth columns; the higher each l2arc vdev is there, the more responsive your pool should feel compared to when there is no l2arc.

The macports daily portsfile is a good choice: http://distfiles.macports.org/ports.tar.gz

alanruttenberg commented 10 years ago

I'll try that and report back. This story still doesn't make sense to me, though. I do understand your general considerations, but there are things that I really don't understand (certainly a pointer to something to read about what to expect and how to tune the l2 cache would be welcome).

The first is the behavior of this specific test, during which the only activity to the disk (well, mostly) is the continued re-reading of the same 30G file.

As you will see from the excel file I collated from the zpool iostat output [1], the majority of the l2arc written (25G) is from the file I am copying, since there is no other traffic. Why am I not at least seeing that amount read back in subsequent reads? After all, I have guaranteed that the cache will be hit by the construction of the test. Every block, 9 times (after the first read) times. What is being written to the L2ARC, and why?

The second is what to expect (or how to tune for) the circumstances I am building the raid for. In my case it is for a personal server. While the disk will have a large capacity (6TB), in general use I would expect there to be a relatively small "working set". Locate is being run each night, as is my offsite backup scanner, so I would expect some of the working set to be the file system structures. Some would be applications, of which I typically use a repeated set over any interval of a week or two: iTunes, Office, my IDE, a statistical program, Parallels with a single VM. The rest would be files I typically during that time span. My expectation is that this "working set" is a fraction of the size of the SSD I have given to L2ARC, generously 50 of the 100G that is available.

I would expect, relatively quickly, that access to these files would be approximately as fast as one would get by storing them directly on an SSD = Fast! But in all my observation of the behavior of zpool iostat, I rarely see much activity reading from the L2ARC. Particulars of my setup, are in [2], hopefully with enough info to see what's happening.

Thanks for taking the time to respond

[1] https://dl.dropboxusercontent.com/u/4905123/ZFS%20Cache%20Test-2014-01-17.xlsx [2] https://dl.dropboxusercontent.com/u/4905123/zfs%20debug%20info-2014-01-17.txt

rottegift commented 10 years ago

There is an excellent and useful comment block in the source itself.

https://github.com/zfs-osx/zfs/blob/master/module/zfs/arc.c#L4418

Note very very carefully the third sentence, starting on line 4423.

The tunables, which you will not really improve on (if you think you actually do better with a given real workload, instrument it carefully and report back), are found from line 4539. You probably should not touch them at all, although the worst that will happen is that you will waste resources, especially scarce ARC memory, getting nearly zero improvement.

"SSD = Fast"

Depends on the SSD - mostly they're interesting because they give you a large decrease in seek latency and consequently a large increase in small random reads. Rotating disks are still fairly competitive with SSD for large sequential reads. L2ARC's design explicitly takes that into account, but that does not help people with really large SSDs and read workloads that are sporadically random; they should consider using them as ordinary vdevs instead (as in zpool create foo -o ashift=13 mirror ssd1 ssd2).

alanruttenberg commented 10 years ago

Thanks for responding. I will have a close read.

A first quick scan finds: Read requests are satisfied from the following sources, in order:

The test was chosen deliberately so as to make the ARC inadequate to satisfy all requests. It seems to me that, given the priorities above and the specifics I document, there would be much more reading from the L2ARC.

The subjective experience is that the pool is very slow (much slower than is observed for this test) for many operations, reading to the disks much more often than I would expect. Slower than my FW Drobo, which is what I was trying to replace. However I wasn't sure how to do a test to document that. I chose the particulars of this test because I considered it the easiest way to get a handle on what was happening with the L2ARC. In any case, I will think further. Certainly any work that lets me see what's happening with the various caches would be appreciated.

It may be that my expectations for ZFS were mistaken in that I believe the behavior I'm looking for is something more like tiered storage, where oft-accessed parts land up persisting in the L2ARC. It hadn't occurred to me that the use of cache in ZFS wouldn't amount to similar behavior.

Regarding "SSD = Fast", I documented the specific SSD and setup in my notes. The theoretical speeds of the SSD as configured are twice (say 500M/s versus 240M/s) the absoluted maximum the rest of the pool can supply. In practice, the degradation from theoretical speed of this SSD is much less than that of the pool, which frequently gives 20M/s or (and very often) less speeds in more typical use.

The problem with using the SSDs as vdevs on their own is that I would then have to manually manage which files are cached there, which is onerous, and I would lose any benefit of the cache on operations of the bulk storage.

All told, I'm still rather suspicious that something isn't working correctly, based simply on the imbalance between cumulative reads and writes to the L2ARC in the test I've documented.

ilovezfs commented 10 years ago

If you are "suspicious" go test the identical configuration with the same tests on other platforms. Otherwise, this is a lot of theory going nowhere.

alanruttenberg commented 10 years ago

Sheepishly, I have to report that the zfs I was compiling and installing was not kicking in at all. The issue was the presence of zfs.readonly.kext in /System/Extensions and the zfs.kext from the lundman install that was in /Library/Extensions. This does point out a weakness in "make install" which I was using, which ought to remove those.

So now I can see the cache working. There are still things I don't understand about its behavior but I'm working my way through arc.c to try to understand it better.

What was the motivation for reducing l1arc down from the defaults? Is it that the memory pressure from OSX is not hooked so as to reduce the cache when needed? I do have instability eventually, and crashes, which I think are due to memory starvation. sudo sysctl -w zfs.arc_max does work, however so at least it can be controlled without recompiling now. Note, however, the initial setting for zfs.arc_max (or arc_min) is not visible in the sysctl output - both are 0. After using sysctl to change the correct value is reported.

I will note that the behavior I was seeing was not consistent with what my settings were (because the kext I was building was not being used and so those settings were not being used) but the comments I received from @ilovezfs on this issue suggested the behavior was correct. So perhaps don't dismiss reports like this so easily. While I'm not so easy to fluster, this kind of dismissal will certainly push away others who are more tentative in their engagement with zfs-osx.

This issue can be closed now, but if you want I will add a new issue about make install and the initial values of arc_max as seen in sysctl.

Oh yeah: If someone can give me a template for how to convert the statistics now being saved in the unused kstat structure to be instead visible through sysctl, I will apply the template to the rest and submit back the path.