[Request] Simple Block based Cache

jetbalsa commented 7 years ago

Mostly for use with slow branches on-top of fast ones and a little larger then what the VFS can do, even just a basic LRU based block cache would be nice to have. and a tunable look ahead where X% of a file has been read to just start pulling the whole thing down before the accessing program has requested it.

trapexit commented 7 years ago

Block cache? mergerfs is a simple unioning of drives. It contains no copy on write behaviors nor work on the block level.

Wouldn't not be easier to use a proper block caching solution like dm-cache or bcache on top of your hard drives?

trapexit commented 7 years ago

Some people have or have talked about doing the following:

Add an SSD to the mergerfs pool
Configure the create policy to favor the SSD
Have an out of band process which moves files to the hard drives after some timeout

I've considered putting something more "official" together but haven't gotten around to it.

trapexit commented 7 years ago

https://github.com/opinsys/dmcache-utils

Split your SSD into N partitions. Use dmc-mklvm to set them up to be caches. Add entries into /etc/dmctab for each.

Maybe I'll write something up for https://github.com/trapexit/backup-and-recovery-howtos

jetbalsa commented 7 years ago

The main issue is that using a network based fuse mount under mergerfs can be pretty slow, even for reads (like S3 Fuse Mount) and I've never been able to find a cache solution for really slow mounts

On Thu, Feb 16, 2017 at 9:22 PM, Antonio SJ Musumeci < notifications@github.com> wrote:

https://github.com/opinsys/dmcache-utils

Split your SSD into N partitions. Use dmc-mklvm to set them up to be caches. Add entries into /etc/dmctab for each.

Maybe I'll write something up for https://github.com/trapexit/ backup-and-recovery-howtos

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/trapexit/mergerfs/issues/373#issuecomment-280539159, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxE1O5wyJNRMNtJCti-g5hQdUEjP1X6ks5rdRKCgaJpZM4MDjN6 .

rubylaser commented 7 years ago

@trapexit I think there are many people that would like directions for setting up some sort of caching mechanism. If you were willing to take the time to write something up, I'd be thrilled.

trapexit commented 7 years ago

It wasn't mentioned that it's a network filesystem being merged. That's a tougher one. For reads especially. If the file is only read once then a cache won't help. If it's read multiple times then the OS will cache it in RAM to the degree it can. And with media it's not common (I think) to watch or listen to the same file within a short time anyway so if you were to cache the file locally it wouldn't help.

What exactly is the access pattern? Something that used some heuristic to move files around could do what you want with mergerfs setup with the right policies.

As for a write up... I was fooling with dm-cache last night an came up with a way to make it work with a typical mergerfs user who's merging hard drives without needing to change much (and no need to move data around) . I'll need to write some tooling to help but I'll look into it.

rubylaser commented 7 years ago

I look forward to seeing that and testing how it works. Thanks for your continued development!

jetbalsa commented 7 years ago

I would do a opt-in for recording usage on mergerFS for anything cache related.

On Fri, Feb 17, 2017 at 10:31 AM, Zack Reed notifications@github.com wrote:

I look forward to seeing that and testing how it works. Thanks for your continued development!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/trapexit/mergerfs/issues/373#issuecomment-280698364, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxE1LezjcsCQD5b6HsYGD8Oe_eoJJIbks5rdct_gaJpZM4MDjN6 .

trapexit commented 7 years ago

I don't follow. Instrument mergerfs to keep track of all file access?

trapexit commented 7 years ago

@jrwr What is it that you're looking for in terms of behavior of a cache. A traditional cache behavior like what dm-cache offers won't help much with a remote filesystem unless you access the data regularly. If you're using mergerfs like most people are (to merge media) then you aren't likely to access the same data often. Data that is read once in a while won't benefit.

What you'd want is something that knows what you'd be watching and transferring it to the cache before you want to watch it... but I'm not really sure what could figure that out. mtimes? atimes? Number of files opened in a directory relative to others? Even if that worked you'd want to do it before you started watching... because you've already said it can't read fast enough for one video to play (otherwise you'd not need to cache) let alone trying to copy all the files in a directory.

Seems to me a human needs to be involved. It could be somewhat automated but I don't see how you can predict what you'll want to be cached.

a tunable look ahead where X% of a file has been read to just start pulling the whole thing down before the accessing program has requested it.

That already happens. It's called readahead. FUSE has some tweakable values but it's already pretty decent and the data is cached automatically by the OS (though in your description that's not really necessary). And things like posix_fadvise can instruct the kernel to be more or less aggressive about it. At most I can see is calling posix_fadvise(fd,0,0,POSIX_FADV_SEQUENTIAL) and POSIX_FADV_WILLNEED) on open'ing of underlying files. I have the dropcacheonclose to do the opposite in the latest code. I did experiment with doing that but didn't notice any real changes in performance but my tests were not extensive. I think a better strategy would be to investigate the remote filesystem's read and readahead.

trapexit commented 7 years ago

@rubylaser Unfortunately it's somewhat complicated. Neither using dmsetup straight up or lvm are well documented. Worse is that it looks like it's not possible to use LVM cache without everything being LVM (origin and cache) and with dmsetup the origin drive doesn't need to be under LVM control but the way it works is clumsy. Practically speaking you have to change your init scripts to launch dmsetup before mounting your drives and to shut them down properly when the machine goes down. With LVM this isn't necessary. Perhaps it's possible to work with LVM only but with a raw device but I've yet to figure out how.

dmcache-utils can manage using dmsetup only but could be enhanced (and I think I found some size calculation bug).

The thing that could be better managed is the creation of cache partitions. There needs to be two partitions per drive cached. You could split a single SSD among your harddrives but you'd obviously get less cache space per drive. It probably wouldn't help with reads since the common usecase is often random data. It might help with writes but its not clear from what I've read that it will help with burst writes which is probably what's most common situation people are trying to optimize.

I'm starting to think that setting up mergerfs with a ssd as the first drive (or otherwise likely to be the drive selected for creating files) and then having an outofband app which would watch the "cache drive" and copy to "storage drives" may be the best solution.

rubylaser commented 7 years ago

@trapexit This makes perfect sense. Honestly, I like the mergerfs+ ssd drive first + a "sweeper" script better anyways. It's not reliant on another application (LVM) and should be easy for people to setup or remove if they want to. Thanks again for investigating :)

einarsi commented 7 years ago

This sounds very interesting, indeed. I like the idea of having an SSD which receives the files, then shifts them over to spinners at a later time, adhering to the specified policies as per usual.

I have a couple of suggestions for how a very simple read-cache can be implemented:

In many cases the last received files will also soon be read. So: Leave the most recently received files on the SSD. One might specify a number of critera for when files should be removed from the SSD, e.g. "leave at least 20GB/10% free space on SSD", "keep files on SSD for no more than 48 hours", "remove first received files first", "remove least recently accessed files first" etc.
Allow for user-land tools to populate the SSD. That would add a ridiculous amount of flexibility. For instance, one could write scripts to ensure that the next couple of episodes of all the shows one has watched on Plex during the last month, or that the photos one is working on in ~~Lightwave~~ Lightroom are made available on the SSD.

The user should not need to worry about whether a file is located on the SSD or not. So most likely the SSD needs to keep copies of the files after moving them to spinners. MirrorFS must of course make sure that the SSD clone is always served, as well as ensure that the clones are kept up-to-date wrt permissions, renaming and other changes that may be performed on the "original" files located on spinners.

trapexit commented 7 years ago

re 1. What I've been putting together so far would order files on the cache by the most recent of it's atime, mtime, and ctime. atime is sometimes turned off or altered for speed reasons (https://lwn.net/Articles/244829/) but that will be a disclaimer. There may need to be more in terms of a heuristic because simply accessing a file doesn't mean too much. Could have just as well run something to chown the files. Perhaps ctime should be ignored for that reason. It might also be useful to prioritize keeping files in a directory which has had its files recently accessed since that may imply the user is consuming them (a TV show).

re 2. People technically can do that today. A trivial way to do it would be to mount a second mergerfs with the cache removed from the pool and rsync the files over. Keeping in the original pool the SSD at the front of the list with 'ff' policy. If that information is available (from Plex) then I'd be happy to look into it.

mergerfs in it's default settings will perform behaviors across all files found where it can. Details are in the docs. rename and link are troublesome depending on the policy but permissions, unlink, etc are straight forward. mergerfs doesn't actively monitor for file discrepancies though. It's not that opinionated (for some reason someone may want different files on different mounts) but there are the mergerfs-tools like mergerfs.fsck which help find issues out of band.

trapexit commented 7 years ago

I wrote a utility to help with setting up dm caches.

It still needs work but can split a device up to be used across multiple slower drives and enable / disable them. Besides tweaks to the app the last thing needed is some examples of setting it up at boot. A systemd service, etc.

My suggestion is use a VM with virtual devices to play with it before trying it on your main rig. I'm not personally using this so no guarantees :)

cosmicc commented 7 years ago

Hello all, My debian jessie media storage server is running mergerfs and I have been looking into a SSD caching solution for it as well. My primary criteria is the cache solution I choose works flawlessly with mergerfs. I was researching bcache, enhancedio, dm-cache, and then I came across this thread.

I never even thought about using mergerfs as the actual cache policies and using ff for the SSD and a separate pool with an outofband app to manage the cache data. This gives a silly amount of configure ability and control over whats in the cache (since the rules are created by the user in a script), and is now on my radar as a solution. The only setback is the time to create and tweak the cache management outofband script (especially for the read cache part).

Aside from that, is solution #2: using dm-cache, working out of the box with mergerfs with this tool? I have never setup an SSD caching solution before, and it seems pretty straightforward, but with mergerfs on top of everything I was unsure how the cache mechanism would act (or work at all). I have seen other posts about caching working to the individual drives, but not caching when you hit the mergerfs union mountpoint.

To sum it up, my questions:

Has anyone tried to implement the solution 1 yet? using just mergerfs pools and a cache control script?
So, is dm-cache with your dm-cache tool you created, an easy(er) way to get cache up and running on a mergerfs union mountpoint?

Sorry if im off-base anywhere here. Just trying to get a grasp on this and my ducks in a row before I make an attempt, thanks!

trapexit commented 7 years ago

I think so, yes. https://github.com/cowai/mergerfs-rebalancer I want to make something more "official" but haven't gotten around to it.
It's a very different way. This actually just splits up a fast drive and helps automate the calling of commands to set it all up at boot. You'd then actually mount the cached drives in mergerfs. i.e. mergerfs knowns nothing of the solution. It just makes it easier to setup. You could just as well use LVM cache in a similar fashion. This script is for those not using LVM.

The dmcache script "works" but has not been used "in production." I've not even benchmarked it. I also still need to provide some examples to have it run at startup. I've only been fooling with it in a VM so as to not fat finger the destruction of one of my drives.

What's your usecase exactly?

cosmicc commented 7 years ago

I'm moving my media storage file server from a raid 5 on freenas to OMV-snapraid-mergerfs setup. The flexibility and bitrot protection from snapraid is much more important to me (in this case) than instant raid failover protection. Mergerfs seemed like a no-brainer choice to bring it together, especially with the OMV plugin. It all just works great for me so far.

So I thought, to put the cherry on top of this, there has to be a way I can get a SSD drive (or drives) to speed this thing up in certain ways. I'm a Dell Compellent SAN installer (among other things), and one of their claims to fame is their use of tiered storage. Unused data trickles down to slower drives, more used data trickles up to the faster drives (spread over 3 tiers in their case), and the top tier always takes in first writes for speed (like cache's "write-back" equivalent).

Keep in mind, in my case, its a headless file server, serving bulky media files to many NFS (mostly) and SAMBA clients. Not much is done locally at all. I have a 10gig network to play with also. I have thought about trying iSCSI as well, and even beefing up my network MTU with NFS and ISCSI to see what kind of speeds i can really push.

Now that I have time think about option #1, im thinking this is the way to go for sure.
Essentially im looking to create a 2 tiered storage using the SSD as the first tier and my mergerfs mountpoint as the second. SSD takes in all the writes for speed, and the read cache could be customized for media (and even plex) to have recent download media on SSD, next watched shows (as explained above), etc... however one wanted to get granular about it. It would seem it would help on the spindown times on teir2 drives as well (if your into the spindown drives thing).

Now the ultimate question is, how much would it really help and/or speed things up? I guess you dont really know until you try. But a very customizable SSD tiered storage "cache" type frontend to my media file server just seems tasty to me, and it feels like now it could only a couple scripts away. again, if the end result ultimately makes a noticeable difference in speed and spindown times.

I need a new project to play with, and since im re-doing my media file server (moving away from traditional raid), I have taken a liking to this idea.

trapexit commented 7 years ago

I too like the scripted approach but my fear is there isn't enough information to make (good) decisions. Is atime, mtime, and ctime enough? What I don't want to do is include access metrics in mergerfs just for this purpose. Even if I had better access metrics would it be enough to know what's going on? That it's not just Plex or something scanning files?

cosmicc commented 7 years ago

I would think for at least the write cache to work well, you would need creation time and access time. The write script would expire created items after a certain time if they havent been accessed, and reset that time if the file has been accessed within that expire time. basically keeping that file on the ssd, until it hasn't been accessed for the set timer and then moves down to the spinning drives.

As for pulling up files from the spinning drives into the cache (read-cache), the only file attribute that would work again (in this media server scenario) is atime. The script (or scripted daemon) would see that file getting accessed and move it up into the cache and start the expire timer on it (basically watching the atime again). Its a crude start, but the only other logic i can think to throw at what it pulls up from the spinning drives into the ssd would be beyond the regular file access metrics, and more into whatever custom media application you use (like plex). Even if it just started out with a simple bash or python script that can expire a file or folder from cache or push it up into cache. At this point, wouldn't it essentially just be a move command between the two mergerfs mounts (the ff one with cache and the one without)? I havent had time to play with the mechanics yet.

mtime and ctime i think in this case don't even matter. since this is basically catered for never changing media files only. mtime and ctime could work for others i suppose by pulling up changed files into the ssd as well, assuming they would be accessed again soon.

ctime, mtime, and especially atime would go a long, long way though.

I know this whole thing is pushing outside of the scope of mergerfs, since this would be catered more for specifically media file storage, and mergerfs goes much beyond just that.

trapexit commented 7 years ago

I'm not familiar enough with Plex's access patterns but if it opens files when scanning to read metadata at all then it would screw up the atimes but yes. As a write cache it's pretty straight forward (but really only useful if you've got 10+ Gbps network or are transfering between mergerfs pools. Both rare I'd imagine.)

I wrote mergerfs because I didn't like the existing solutions for managing my media. People may be using it for other things now but that's the origin :) I just have never had a need for caching so I've not focused on it till mergerfs had somewhat plateaued. Same with my other tools: scorch and bbf. Didn't like existing solutions so I wrote my own.

trapexit commented 7 years ago

I've added a section in the docs regarding file level caching using time or cache percentage expiring.

https://github.com/trapexit/mergerfs#caching

trapexit / mergerfs

[Request] Simple Block based Cache #373