kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
557 stars 242 forks source link

IMPORTANT Message from an experienced BTRFS user, this is a must read regarding RAID 1 ONL!NE FUNCTIONAL!TY. #466

Open fixapc opened 2 years ago

fixapc commented 2 years ago

•Huge risks for 2 disk raid 1 users that are upgrading their drives because of storage concerns, READ ALL TEXT.

• PROVEN IN A PRODUCTION ENVIRONMENT. So i have been using BTRFS for multiple years now in a production environment and have stress tested the filesystem pretty well and I am aware of its capabilities at least most of them from my understanding.....

• WHY BTRFS IS AWSOME - Why it became my FS of choice I have been so comfortable with it that i hot swapped and rebuilt larger multi drive raid 10 arrays often.... sometimes i yank multiple drives out at a time just to see what it can handle before going over the raid ratio. never a problem.... I was even able to recover a multi disk raid 10 failure after a shorted drive overloaded a perc controller on a R720XD and downed not 1 but most of the drives in an array by shorting the raid controller, motherboard and other stuff that's not too nice to see. In other words it became my filesystem of choice after it was able to recover from 1 of the worst failures possible that had nothing to do with BTRFS. It took swapping the super block, finding root fs, a few drive scans and eventually a recovery mount but i was amazed it was even possible...... BTRFS is now confirmed my FS of choice.

• NOW, THE PROBLEM OF DICTATORSHIP. So now that you guys have got the good news, here's the bad news. Its an easy fix and it needs to be changed now. It will also have a big impact on the number of users, using BTRFS. So I never had a problem with BTRFS until the other day and it came with a 2 disk RAID 1 setup. So I was attempting to replace a single disk out of a 2 disk array which I found out was nearly not possible without multiple balances and or just yanking the drive follow by a rebuild. So why is BTRFS the best file system for developers? ITS ONLINE ABILITIES. It just works..... it does what your commands tell it to do... Until you give it a simple command of removing a disk from a minimal raid 1 setup (BYE BYE ONLINE ABILITIES).... You are presented with a message that you are not allowed to go below 2 disks in a raid 1 setup. So I thought about it for a second... does this make sense? NO IT DOES NOT. As we know a raid 1 is a simple mirror of its fellow relative... There is no reason why as soon as you issue a command to remove the disk it should argue with you about something that is easily achievable. This leads me to believe developers are attempting to add a fail safe for protecting the user against them self's. A silent rebalance of data could be done before the remove process with priority over the last remaining disk to contain both meta data and data. Followed by a removed BTRFS device. Because this is what BTRFS device remove and replace does. IT JUST WORKS and I have used it many many times. Everything I do with BTRFS just works. >>>---Unless there is a specific reason for it not to work --- <<< But in this case, there is no reason a user should be told they cant do something that is possible. What do most people do when replacing a drive in a small disk array? and why are they doing it? well they want to replace the disks because they need to upgrade or it failed? what if BTRFS is running on a 2 bay ONLY nas? The user is now in a position of looking at a forced reboot or a breaking of the array. If they do it before a balance they could be in big trouble?

• AUTOMATIC PAUSE BALANCE AT 1% REMAING DISK and 2 DISK DICTATORSHIP So before telling the user they cant remove a disk for their own protection, most likely because they need to upgrade the space on the 2 disk array they have. Warn them that if converting to a single or dup based raid policy with no extra command lines will lead to READ ONLY FILE SYSTEM WHILE THE DISK IS FULL. Making these 2 changes immediately will have a large impact on BTRFS growth and its outlook. A large portion of end users are using 2 disks raid setups as most people using raid 1 are usually only on a 2 disk system anyway. The entire advantage of BTRFS is that its online abilities are something that ZFS cannot even match and makes it the best file system for hot swapping drives, changing profiles, rebalancing disks or transparent file compression.

• PUT ZFS IN THE GRAVE WITH LZ4 BASED RAM DISK CACHING AND ADD LZ4 TO THE COMPRESSION ALGORITHM FOR STANDARD FS USAGE. The only reason I would choose ZFS over BTRFS is because of its arc caching abilities which from my understanding is like strapping a ram disk to the entire array to use as cache but without actually adding a ram disk. In specific workloads BTRFS lags BEHIND ZFS badly and I have been tempted to make the switch. The online capabilities and what I have seen BTRFS survive is what makes me stick with them.

• AUTOMATIC PAUSE BALANCE AT 1% REMAING DISK - This is more important that telling the user its too risky to remove the 1 of the last disk in a raid 1 and it is where most of the BTRFS negativity is coming in at. In fact I believe this issue will be presented to most users at least once if they are using BTRFS at an end user level in small raid setups. They will only be presented with nightmares that will leave them hating BTRFS. Restricting the user for a small risk and allowing a complete catastrophe to happen when they are forced to rebalance to a new raid type BECAUSE they are TOLD they CANT removed 1 of the last raid devices. IF THEY ARE DOING THIS TO UPGRADE SPACE THEY WILL HIT READ ONLY FILE SYSTEM WHILE FULL.

• THE END USER AND THE DANGER OF DUP, ROFS WHEN FULL This is a nightmare scenario that I have witnessed first hand and it needs to be resolved. End users are using this software as well in high frequency on 2 disk only raid 1s. If they are upgrading a 2 disk raid 1 for space concerns they are already in a RED ALERT danger zone ESPECIALLY IF THEY CHOOSE DUP. Chances are they will than attempt to do things the safe way by balancing a new raid type to than remove the disk. If they have not been over all of the man pages and or BTRFS documentation they will than be presented with a locked system, full storage and read only.

Google "BTRFS balance read only fs" - Id suspect most users are making these post because they are being redirected from a simple denial of a raid 1 disk removal to an attempted balance with DUP and maybe even single. Users are warned about small issues before command execution and preventatives to ensure data integrity. Being redirected from a small denial as per safety prevention to a possible ROFS while full because they attempted to rebalance their 2 disk raid............. that was most likely running out of space already.... attempting to be able to remove a disk..... See where this is going? This is a very likely scenario and its happening all the time according to search results. Their needs to be a 1% space left balance pause before telling the user its not safe to remove 1 of the 2 disks in a raid 1 that is pointing them in an even WORSE possible direction. This is EXTREMELY contradictory. BTRFS is an amazing file system but I think this needs to be resolved.

• GO BTRFS!, I LOVE YOU GUYS. (RAID1C3 , RAID1C4 IS AWSOME) Thank you for all your hard work on BTRFS, it is truly the filesystem of the future. I felt I had to strongly express my opinion about the denial of command execution for 2 disk raid 1. Don't take it to heart, I love BTRFS and I couldn't do what you guys do. I got my personal server sitting on some Bcache ram disks till you guys finally add a custom ram backed caching. This is just honest insight. Don't hate me.... or come to my house when I am sleeping at night. My wife is dangerous.

adam900710 commented 2 years ago

Thanks for enjoying btrfs first.

Then for the problem you mentioned.

You are presented with a message that you are not allowed to go below 2 disks in a raid 1 setup.

Yeah, that's the pain point. Btrfs RAID really cares about the ensured ability to lose disks. RAID1 without the ability to lose any disk, then it's really SINGLE.

There are some ideas on making converting from RAID1 to single way faster (since we can just mark the chunk SINGLE, get rid of the extra copy and call it a day).

Would an almost instant convert from RAID1 to SINGLE meets your need in this case? If so, we may want to go that path in the future. (especially considering it's really not that complex to do).

• PUT ZFS IN THE GRAVE WITH LZ4 BASED RAM DISK CACHING AND ADD LZ4 TO THE COMPRESSION ALGORITHM FOR STANDARD FS USAGE.

This sounds like ZCache, which should work for all page caches in Linux. Unfortunately not yet in upstream, but when it get into upstream, every fs would benefit from it.

kdave commented 2 years ago

Thanks for the write up, lots of points, some of them are familiar. I'll try to answer where I know or at least add a todo item.

fixapc commented 2 years ago

Thanks for enjoying btrfs first.

Then for the problem you mentioned.

You are presented with a message that you are not allowed to go below 2 disks in a raid 1 setup.

Yeah, that's the pain point. Btrfs RAID really cares about the ensured ability to lose disks. RAID1 without the ability to lose any disk, then it's really SINGLE.

There are some ideas on making converting from RAID1 to single way faster (since we can just mark the chunk SINGLE, get rid of the extra copy and call it a day).

Would an almost instant convert from RAID1 to SINGLE meets your need in this case? If so, we may want to go that path in the future. (especially considering it's really not that complex to do).

• PUT ZFS IN THE GRAVE WITH LZ4 BASED RAM DISK CACHING AND ADD LZ4 TO THE COMPRESSION ALGORITHM FOR STANDARD FS USAGE.

This sounds like ZCache, which should work for all page caches in Linux. Unfortunately not yet in upstream, but when it get into upstream, every fs would benefit from it.

The balancing functionality and raid profile changing and somewhat of a "hot-device add" is what make BTRFS a no brainer.

I just think a slightly different approach should be taken when dealing with BTRFS device remove. A user should be allowed or be asked if they wish to degrade the device by removing it instead of a forced balanced to a different raid profile that may cause them to lock into a out of space read only file system in certain scenarios. It is more environmentally friendly and saves a lot of unnecessary data movement. Removing a device seems like it should be a very simple and light code process. It should be geared towards "just removing the device" and letting the FS be prepared for it. The vocabulary of BTRFS device delete says there is a lot going on. The vocabulary seems as if it may be a bit conflicting with BTRFS device remove. When it comes to its operation in general. These risks are statistically provable in lower disk environments especially during the learning phase of BTRFS. Allowing a drop into a degraded mount would save a lot of time for a lot of users so long as they know the risks prior. BTRFS is already geared towards users making file system changes in a live environments. I think this would help in many ways :) This also helps the online efficiency GREATLY when dealing with different scenarios.

>>>Think of the 2bay low profile NAS sitting in that guys modern home with his PHAT 26TB WD 3.5' drives. He might come looking for you........<<<

I understand there are some risks that go along with this especially if they do not have their data / metadata balanced correctly. But I don't see the point to in not allowing the user to drop down into the specified raid redundancy level support by their raid level so long as they are properly balanced.

If someone needs to upgrade a 2 bay storage this would be a life saver.

• Fail safe pause at 1% remaining disk space usage during balancing.

• simplification of BTRFS device remove would be AMAZING. Or add a BTRFS HOT REMOVE feature, that drops the raid into degraded mode and prevents unnecessary movement of data.

Best Of Luck and cheers!!!!

adam900710 commented 2 years ago

A user should be allowed or be asked if they wish to degrade the device by removing it instead of a forced balanced to a different raid profile

Exactly what I'm purposing, an instant convert/degrade instead of read-write for new chunks.

But there are some blockages.

First, that only works for RAID1 based/like profiles (including DUP, RAID1, RAID1C3/4, RAID10). For other profiles, like RAID0, RAID56, that won't work and still requires the old fashioned behavior.

If we only focus on RAID1 based/like profiles, then it's indeed much easier, and we only need to bother about the interface.

Thankfully in RM_DEV_V2 ioctl we can add extra flags to do a quick degrade style removal if possible.

In that case, I can try a prototype to do degrade removal when possible.

fixapc commented 2 years ago

A user should be allowed or be asked if they wish to degrade the device by removing it instead of a forced balanced to a different raid profile

Exactly what I'm purposing, an instant convert/degrade instead of read-write for new chunks.

But there are some blockages.

First, that only works for RAID1 based/like profiles (including DUP, RAID1, RAID1C3/4, RAID10). For other profiles, like RAID0, RAID56, that won't work and still requires the old fashioned behavior.

If we only focus on RAID1 based/like profiles, then it's indeed much easier, and we only need to bother about the interface.

Thankfully in RM_DEV_V2 ioctl we can add extra flags to do a quick degrade style removal if possible.

In that case, I can try a prototype to do degrade removal when possible.

Nice! well I will be happily awaiting the update. The online functionality for lower drive counts with balancing features make this a better option than ZFS for data redundancy in lower drive counts. At least this would be my opinion.... So I cant wait to see this implemented. I am running BTRFS on multiple servers at work, at home and some other side projects I do. Anyway the only problems I have seem to come in contact with are related to being a bit abusive with the balance and raid profile switching that can lead to a broken filesystem. But I can do another post on that later. Anyway I CANT WAIT to see this implemented. How long do you think anyway?

leszekdubiel commented 1 year ago

I'm using Btrfs for years now. In some servers I use Raid1 made of two disks only.

So this is a perfect solution:

If we only focus on RAID1 based/like profiles, then it's indeed much easier, and we only need to bother about the interface.

I would like to be able to tell Btrfs "Hey! One of the disks has failed. You have only one disk left, don't care for Raid1 now, work as single mode from now on...".

PrplHaz4 commented 1 year ago

Having just run into this recently and had a terrible time finding reliable information about doing an in-place upgrade on a 2-disk raid1 array - just acknowledging this "gap" and providing explicit documentation on this use case would be a tremendous step forward.

Zygo commented 1 year ago

The important problem here is that btrfs currently does not provide any reliable safe path to move from 2-disk raid1 to single-disk filesystem with dup metadata. The requirements are:

Starting from a 2-device raid1:

  1. We want device 2 removed and device 1 kept
  2. If device 2 fails at any time during the removal process, or if it has already failed before we decide to remove it, we want the filesystem and all data to stay intact (i.e. we maintain the initial raid1 guarantee). We have explicitly chosen to make a single-device filesystem on device 1, so if device 1 fails, it takes all the data with it, but that's what we asked for.
  3. We want the number of metadata copies be 2 at all times, i.e. we want to go directly from raid1 profile on 2 devices to dup on device 1 only, and we never want dup metadata on device 2.
  4. We want continuous filesystem service to applications while this happens

Here's some methods you might find online that don't work:

  1. btrfs device remove 2... doesn't know how change profiles, only the min_devs restrictions for each one
  2. btrfs balance start -dconvert=single,soft -mconvert=dup,soft ... is a necessary component of a working solution, but violates requirements 2 and 3 by itself
  3. echo 1 > /sys/block/$devid_2/device/delete; btrfs balance start ... violates requirement 3 and requires sysfs support from the block device
  4. umount, disconnect device, mount -o degraded ...; btrfs balance start ... violates requirements 3 and 4

It is currently possible to write a python-btrfs script that gets very close to meeting the requirements:

  1. resize device 2 to the point where it has no unallocated space (i.e. it's completely full. This is an oversimplification--a real implementation needs to ensure there is some unallocated space to avoid ENOSPC, but the size and position of that unallocated space matters)
  2. verify there is sufficient space on device 1 for dup metadata, restore device sizes and abort if not
  3. convert the last chunk on device 2 to single or dup if that chunk is a data or system/metadata cnunk, respectively. Since device 2 is completely full, the chunk allocator has no choice but to place the data or metadata on device 1.
  4. resize device 2 so that it ends at the last block group on device 2, which removes the unallocated space that was created as a result of step 3
  5. repeat steps 2-4 until device 2 is empty
  6. remove device 2

but that's still not a complete solution. Data allocations cannot be allowed to use device 2, which means the filesystem is effectively full during the entire removal process, potentially violating requirement 4. If an application deletes enough data, it could create an unallocated hole on device 2 at a location other than the highest offset on the device, which might be filled by single data (violating requirement 2) or dup metadata (violating requirement 3). If you have to downsize a raid1 today, and you can ensure no applications are writing to the filesystem during the conversion, then this is the best way to do it without kernel patches.

Internally, btrfs can set a device's size to zero, which prevents new allocation on the device (this is part of what btrfs device delete already does). Indeed that would be a generally useful device delete extension: simply mark a device's size as zero, so that no further allocations happen there, but do nothing else, allowing the user to run further commands to move the data between devices (technically this would be a fi resize feature, since it may be useful to set a no-allocations-above size on multiple devices simultaneously)

With https://github.com/btrfs/btrfs-todo/issues/19 (particularly the "none-only" extension near the bottom) we can forbid new data or metadata from appearing on device 2. Then we can do a normal btrfs balance start -dconvert=single,soft -mconvert=dup,soft and the balance will do all the necessary conversion in a single step. This approach also works for more complex conversions, like e.g. converting a raid6/raid1c4 filesystem into a raid10/raid1c3 filesystem by deleting 5 out of 8 disks in a single pass.

Indeed both methods could be implemented independently of each other. The allocation preferences patch is still somewhat controversial, while the set-device-size extension has a much smaller scope and no requirement for on-disk persistence. Previous versions of device remove had exactly that behavior as a bug, so it's already tested and known to work. All we'd have to do is create an interface to trigger the old bug intentionally, and relabel the behavior as a feature.

Other notes:

leszekdubiel commented 1 year ago

Starting from a 2-device raid1:

1. We want device 2 removed and device 1 kept

Yes. :)

2. If device 2 fails at any time during the removal process, or if it has already failed before we decide to remove it, we want the filesystem and all data to stay intact

Yes. That's our goal.

i.e. we maintain the initial raid1 guarantee).

No. No more raid1 guarantee.

There were 2 devices, 1 of them failed, there is no more raid1. We have only one physical devices left, and we don't have any raid.

Btrfs should go for single mode and don't make problems bigger. Just use the one device left, and the second one sould be immediately kicked out off an array.

We have explicitly chosen to make a single-device filesystem on device 1, so if device 1 fails, it takes all the data with it, but that's what we asked for.

Exactly.

3. We want the number of metadata copies be 2 at all times, i.e. we want to go directly from `raid1` profile on 2 devices to `dup` on device 1 only, and we never want `dup` metadata on device 2.

I would like Btfrs to not make problems bigger.

I don't need two copies of metadata. I don't care if Btrfs wants raid1 or more space.

Just go single and work as normal.

4. We want continuous filesystem service to applications while this happens

Yes. That's the goal.

  1. First we have had two disks in raid1. We felt secure.

  2. One of disks failed, so we are no longer secure. (*)

  3. If the last drive fails then we loose all data.

(*) we are no longer secure, then we don't need metadata dup.

All we need from Btrfs is to allow read/write data using the only one disk that is still operational.

This should be the same with Raid 1c4. Four disk, four copies of data. If one of disk fails, then Btrfs goes 1c3. Admin knows that security degraded and he has only 3 instead of 4 copies.

If another disk fails, then Btrfs goes 1c2 immediately. Admin knows he has only 2 copies left. But users, applications can work as normal.

My case study.

There is a server, 3 disks, raid 1 (normal raid, 2 copies of data). One of disks is failing (16 Uncorrecable Sectors). What can I do?

I can't kick this disk out of array, because I will have to make balance. BAlance takes long, and Btrfs would be stuck (will not allow snaphshotting, it will break backup).

So even though I have 3 disks, my only solution is to go for backup — snapshot, send/receive, start with another set of disks.