Kick-out disks with errors

guestisp commented 6 years ago

MooseFS has a nice hdd check, after a X number or I/O errors in Y seconds, disks is flagged as bad and marked for removal, triggering a replication.

This is very very cool as you don't have to wait for a total failure to migrate data from. Assured that Lizard works great without any RAID, prevent catastrophic disk failure could save your day.

It shouldn't be that hard to backport from MooseFS (for a c++ programmer): https://github.com/moosefs/moosefs/blob/5c518e96571d285b3155665152b08b68d8101354/mfschunkserver/hddspacemgr.c#L1767

it's just a matter of a simple check on a counter.

4Dolio commented 6 years ago

I have seen LizardFS mark a disk as failed on 2.6.0. Note sure if replication or marked for removal works. I never use mark for removal, i just take the failed disk offline and then it replicates for sure. Not sure what improvements modern versions behave, and 2.6 did not actively resolve a fault but sometimes you might not want it to... It has never really failed during any component fault.

guestisp commented 6 years ago

I agree, but if this feature is not needed, could be disabled by setting the limit to 0, so that in case of multiple I/O errors, disks doesn't get kick out.

In some cases, a proactive disk replacement is usefull. If you have 4TB disks without raid, is much safer (and faster) replace a failing disk, before it fails. The replication can read from all replicas (even from the not-yet-failed disks) and not only from other replica.

In example: you have a replica 3, with 4TB disks. disk0 is starting to trigger I/O errors, so is failing (and will fail shortly) but currently still working. You have to replicate 4TB. With the current way, you have to manually see if disks is triggering some I/O errors or wait for a total failure. In case of total failure, the only way to replicate data is by using the other 2 working disk, from the network.

With my proposal, you can also replicate by using the not-yet-failed disk, thus reading from 3 disks (and 3 servers) at once, and not from only 2. This is much faster.

There is another huge advantage (even bigger for replica 2 clusters): replacing a not-yet-failed disks will give you more time to prevent any issues before they arise and help you to stay with undergoal chunks as little as possible, because you are migrating out existing chunks, without any chunks lost. You won't be in undergoal, but in overgoal. That's a huge difference.

Another big related feature would be the native integration with SMART. If SMART triggers a failure prediction (disk not passing the SMART test), an automatic replacement could be made for the same reason: you'll replace with overgoal, not undergoal or endangered chunks.

This is the same as most software raid does (mdadm, zfs) when you have to replace a disks. The redundancy is never lost, not only, for the whole operation, you'll have an increased redundancy. Even some modern raid controller does the same (honestly, I don't remember the exact function name, at the moment)

4Dolio commented 6 years ago

I was not arguing against your proposal, just offering some related observations.

guestisp commented 6 years ago

Anyway, this is a trivial patch to backport from MooseFS. I hope some dev will take care of it

I've looked at the code and my biggest concern is how to store the microtime on every I/O error. Everything else is just a counter and a comparison. Very trivial.

guestisp commented 6 years ago

I've seen that this is already partially supported. The counter is already there: https://github.com/lizardfs/lizardfs/blob/master/src/chunkserver/hddspacemgr.cc#L980

Main difference is that Lizard is setting the folder as damaged: https://github.com/lizardfs/lizardfs/blob/master/src/chunkserver/hddspacemgr.cc#L983

while MooseFS is setting the folder as to be removed: https://github.com/moosefs/moosefs/blob/master/mfschunkserver/hddspacemgr.c#L1771

Anyone knows what "folder" is in Lizard and the meaning of todelete, toremove properties ?

lizardfs / lizardfs

Kick-out disks with errors #694