Execute and pass the catastrophic failure tests

brettkettering commented 7 years ago

MarFS Multi-Component Catastrophic Failure Tests.docx

brettkettering commented 7 years ago

Have done the tests failing a capacity unit and a pod. Eventually the system recovers, but we get into a low performance state because of the ZFS bug. We'll retest after implementing a workaround for the ZFS bug and see if we can maintain decent performance. Need to check that the degraded performance logs are happening correctly.

wfvining commented 7 years ago

I have verified that the degraded write logs are functioning correctly. If a capacity unit is down (or goes down) during a write all objects that hash to it are logged as degraded and a subsequent run of the rebuilder will reconstruct the missing block.

brettkettering commented 7 years ago

Will, Chris D., and George:

Our plan was to do the one catastrophic failure test that we haven't done, which is the power failure. Before all of these tests, we'll shutdown Scality nicely and make sure we can bring back Multi-Component only. We'll hand start Scality when we're ready to have it back. We'll start with an IPMI power-off of one of the storage servers. Later we'll try to power-off everything in the testbed and bring it back.

What's our readiness to try this test?

Thanks, Brett

thewacokid commented 7 years ago

Brett -

I accidentally did part of this when debugging the storage nodes (IPMI power off of all storage nodes simultaneously). No issues bringing back Scality or MC from that. Unless this test includes the logger, supervisor, and master - in which case it might be messy to clean up the Scality side of things.

If MC/ZFS is the only thing being tested here, I’d say this has been done more than a few times (on other systems and this one).

Dave

On Feb 13, 2017, at 11:14 AM, Brett Kettering notifications@github.com wrote:

Will, Chris D., and George:

Our plan was to do the one catastrophic failure test that we haven't done, which is the power failure. Before all of these tests, we'll shutdown Scality nicely and make sure we can bring back Multi-Component only. We'll hand start Scality when we're ready to have it back. We'll start with an IPMI power-off of one of the storage servers. Later we'll try to power-off everything in the testbed and bring it back.

What's our readiness to try this test?

Thanks, Brett

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mar-file-system/marfs/issues/178#issuecomment-279474761, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRbVd1iZ-leIr1lC9HdQzYueuD5m2geks5rcJ2VgaJpZM4Lz-Ia.

wfvining commented 7 years ago

I agree that this has been done. However, I would like to try again given the changes to libne that call fsync() every 50MB for each part file.

brettkettering commented 7 years ago

Kyle asked for this test as a way to mock-up a lightning strike that takes out power to the SCC. Yes, what we really want to know is how MC comes back. So, we's want to power-off:

1) The master 2) The GPFS file system servers 3) The ZFS file systems servers 4) The JBODs with all the drives in them 5) The FTAs

A power hit to the system. It seems like you have done many or even most parts, but we haven't hit the whole thing at once.

brettkettering commented 7 years ago

Through the testing we've had many different types of server and disk failures. We've been able to get the system back online each time.

mar-file-system / marfs

Execute and pass the catastrophic failure tests #178