trapexit / mergerfs

a featureful union filesystem
http://spawn.link
Other
4.25k stars 173 forks source link

minfreespace #110

Closed rubylaser closed 9 years ago

rubylaser commented 9 years ago

It appears that the minfreespace option doesn't work with epmfs unless I'm not understanding the options correctly. Does this allow a disk to completely fill and writes to fail if a path already exists on the disk? For example, I have two disks pooled, one that has a directory on it, and the other that doesn't. The one that has the directory on it only has 12GB available. What happens if I try to write 20GB to that same path?

trapexit commented 9 years ago

As the docs mention minfreespace is used with lfs and fwfs policies. It can be argued that it should apply to any policy but it currently does not.

Correct. If you are using epmfs it does exactly that. It picks the drive which has the path in question and the most free space. If you write more than is available on the drive it will return ENOSPC on the write. mergerfs does not move files and continue writing as mhddfs. It complicates the runtime and the code. It will be looked at as an optional behavior at some point.

ariselseng commented 9 years ago

So it means that if the free space on the drive which has the folder is effectively how much free space I have for that operation. The pool will report having lots of free space, but the program writing the 20GB file will suddenly report disk full even if the pool says otherwise. Right?

If this would work as in mhddfs, my SSD as the first drive would not fail in fwfs if it is nearly full, a big write would simply move and continue to the next drive.

rubylaser commented 9 years ago

Thanks for the speedy reply! So I completely understand, if I use lfs or fwfs along with the minfreespace option set at 10GB and I follow the same example above, and have 12GB available on my disk with least free space, and write 20GB to the pool, will that also fail to write or will mergerfs realize the disk with the least free space can't accommodate the whole file and instead write to the next available?

I just switched from epmfs to lfs to see if this works differently.

/mnt/data/*  /storage  fuse.mergerfs  defaults,allow_other,category.create=lfs,minfreespace=10G  0       0

I have many disks that are completely full, and if mergerfs can't roll over to another disk, I will have a lot of failed writes. Thanks!

trapexit commented 9 years ago

There is no good way to report free space. Either you report the free space of the drive with most free available or you return the combined. Both lie. I chose to go with the latter given it must be understood you're using a group of disks each with their own limits. Otherwise use RAID0 or JBOD to get cross drive striping.

So yes. The drive picked for an operation is the one used throughout. mhddfs will try on ENOSPC errors to find a drive with space and move the file there. Given mergerfs has configurable policies it's harder to do. Does the policy still get enforced or do you just try to find the first drive it may fit on? And it's not like the drive you're writing to can't also fill up. The system doesn't stop otherwise while you write to the new drive. And you don't know how much is to be written so worse case you copy the file N times only to find out you've run out of space on the last drive. Under contention it's not clear what to do.

If minfreespace is 10G then that just means that on creates the drives with less than 10G will be skipped. If you have 12G free it will pick that drive... if you write more than 12G it will fail with ENOSPC and that's the end of it. The next create will skip that drive but the original file will not be moved automatically to a drive with maybe enough space and the writes continue.

trapexit commented 9 years ago

Remember that we are working with low level operations. It's not known how much data is to be written. All that is seen is 'open', 'write' x 100, 'close'. Or the like. When mhddfs tries to write and gets ENOSPC it will try to find a drive with at least the space written already and if found will move the file there and then continue writing. This behavior is convenient at times but also bad in that trying to write 1 byte really writes however much needs to be moved. And if the drive selected ends up returning ENOSPC it will happen again until all drives are used up. Mergerfs makes it more complicated because if you ask for "existing path w/ most free space" is it OK to just pick some random drive which happens to have space? You've said to use existing paths. Or do we say that under contention all rules are dropped? It makes it more difficult to reason about so I've explicitly left that behavior out until all other aspects are ironed out.

rubylaser commented 9 years ago

Okay, that makes sense. Maybe, I should just set the minfreespace to a higher value like 40GB to remove the risk when I'm copying large files to the pool.

I understand the issue better now, I can just see this being an issue for users with large pools with many disks (like me) trying to copy files to the pool and having writes fail. I'll have to think a bit more about how I can work around these issues for me use case.

Thanks again for your speedy and thorough replies!

trapexit commented 9 years ago

The easiest would be to use mfs rather than epmfs. Then you're limited only by the largest disk. Like I mentioned I'm willing to add an option to do the mhddfs "move file to a drive with enough space" behavior but it needs to be determined how that will work. Do I ignore the policy and just in effect do mfs? That breaks the ep part of epmfs but unless I do that it really won't work... since I would have selected the drive with most free space with that path to begin with.

rubylaser commented 9 years ago

I think the only way would be to check each disk with the path and see if any of them have the minfreespace available to accommodate the file. If they don't have the space you would fall back to mfs. All this checking sounds like it could be slow and error prone. My reason for wanting this is to prevent having my files sprinkled all over my pool. This prevents the need to spin up a bunch of disks to read a file.

trapexit commented 9 years ago

I was just about to suggest that. That's easy enough to do.

epmfs would not only filter by path existence but by minfreespace. If no drives are found compatible then fall back to mfs.

It's slightly more work to do that check but won't slow it down. I'll add it in a future release. I've not released a new version on account of reworking how rename works. I hope to be done today and can toss this epmfs change in.

rubylaser commented 9 years ago

Well, if you can make that happen, that is the best of both worlds :)

ariselseng commented 9 years ago

I was thinking of using a small partition (maybe 20gb) from my ssd as first drive with fwfs (minfree at maybe 4gb. But to cover cases where I am writing a big file (30-40gb) it will fail because the ssd will be full at 10-20gb. That could be data lost. Or maybe a broken upload. For me writing bigger files than free space on selected drive should not result in a broken file or data loss if there are lots of space on the other drives. I understand though that I may have a special use case :)

trapexit commented 9 years ago

The only way to address that is to do the mv on enospc. There isn't a way to know (practically anyway) how much data will be used. I could possibly respond to a fallocate by double checking the disk size the file is on and recreating it elsewhere but that will only work in a subset of situations. With move on enospc... the SSD would be filled as the file is written and then when enospc is received we'd copy that 20GB to a drive with at least 20GB free and continue writing. That would negate the speed gained from writing to the SSD. It seems unlikely that you'd be writing so much that you'd saturate the throughput of a spinning disk or that all this data proxying would in the end be worth the effort. Even USB2 gives you upto 30MB/s.

As I mentioned I'll be looking into move on enospc as a feature but this caching you're talking about doesn't sound like it should be necessary.

ariselseng commented 9 years ago

The main reason I want ssd write cache is to have my disks sleep as long as possible and only write to them when ssd is almost full or at a certain interval. As a bonus I should also see better speeds. And also: because I can ;)

ariselseng commented 9 years ago

There wouldn't be many times the mv on enospc would be needed. But a lack of it, means I would need to dedicate half of the ssd just for minfree. Big files should also go on the ssd and for that I would need at least 40gb minfree, which basically means I cant use 40gb of my ssd. With mv on enospc minfree could be something like 4gb and if it gets full, then nothing bad happens to the program writing of the files.

trapexit commented 9 years ago

Spin up I get but I don't see it helping with speed much. You are limited by the slowest component. Probably your network speeds. Two tier systems often are used in high throughput environments which I suspect you don't have (mergerfs wouldn't be good for that living in userland). "because I can" isn't a good argument for added complexity

I'm not arguing mv on enospc isn't useful but it's not perfect and will lead to very inconsistent timings. I will need to make it work with policies as well which needs to be thought out.

rubylaser commented 9 years ago

You could do what you are asking with a caching disk easily if you don't mind having them as separate mountpoints. I'd just setup a nightly cronjob to sweep the files off your /cache mountpoint (ssd) to /storage (mergerfs pool) via rsync or mv. You could even roll a fancier version that uses ionotify to trigger the action if you wanted to get crazy.

I thought what you were asking for in the other thread was a cache combination akin to a ZIL and L2ARC in ZFS. What you are really asking for is more like UnRAID's cache disk. The main reason that exists is UnRAID is so slow to perform a write. This just isn't the case here, so I don't see the benefit.

Also, if you have hdparm setup, you can have your disks spin back down in as little as 30 minutes of inactivity anyways.

trapexit commented 9 years ago

He would still have the large file issue. mv on enospc is the only way to handle that transparently.

Even with mv on enospc you'd need something to rebalance. rsync isn't ideal but I plan on creating a tool which would be mergerfs aware and could be run on occasion to clean up. Otherwise the cache would fill up with small files and always hit the minimum size... ending up with the discs being hit each time.

rubylaser commented 9 years ago

+1, this is exactly what I wanted in the other thread with the tool to re-balance files. Thanks for the speedy replies, and I look forward to epmfs mode with minfreespace support. That seems to be the perfect mode for my home use.

ariselseng commented 9 years ago

I have wrote a script that is aware of mergerfs mountpoint. It takes two parameters: input folder and output. ./rebalancer.sh "/mnt/smalldisk" ["/mnt/bigdisk"]. If $2 is not set, it will find the drive in the pool with the most space available.

rubylaser commented 9 years ago

What does the code look like?

ariselseng commented 9 years ago

@rubylaser https://github.com/cowai/mergerfs-rebalancer

Make sure to have no trailing slashes for the inputs.

Right now it finds files from source folder (-a $path) older than 60min (-t $min). And moving them to either the disk with most space, or to a specified folder (-b $path).

rubylaser commented 9 years ago

Nice! A nice set of shell scripts. Thanks for sharing :) Are you running this via cron hourly or in some other fashion?

Also, you could check the user input and cut off the trailing slashes in the script if the user inputs them.

ariselseng commented 9 years ago

@rubylaser I have not started using it, as I also want to be able to specify how much of the disk it should drain, and how much the target drive should be filled. I was thinking I could use this for an ssd which I put at the top of the pool along with write mode "ff. I want it to be able to calculate how much it needs to drain of the ssd. If daily writes where around 10gb. I could make it so that the ssd always has 15-20gb free. We want the ssd to be as full as possible without making unusable for writes, so that also reads are fast.

rubylaser commented 9 years ago

@trapexit Awesome work! I'll try to verify this tonight :)

rubylaser commented 9 years ago

@trapexit It's working like a dream. Thanks!