Add documentation describing DR situations

schakrava commented 10 years ago

A useful comment from HN

My very first questions regarding a potential storage solution revolve around data loss:

1. Can we enumerate the data loss scenarios?
2. How is drive failure handled?
3. How may data be corrupted and such corruption detected?
4. For every data loss scenario, what is the recovery procedure?

Here is all I could find: http://rockstor.com/docs/faq.html#how-do-i-prevent-data-loss...

Of course, there is a wealth of information on such questions for standard RAID, but I would suggest for marketing purposes that rockstor synthesize available information (from the many relevant layers of data management) in a coherent fashion, specific to their product. It doesn't have to be deep, but it should be at least minimally comprehensive and broad, with pointers to more detailed, layer-specific information.

Also, it's fine if the recovery scenario is "restore from backup" for e.g. the scenario where data is deleted by an authorized user. If so, there should be at least a minimal "backup story".

rickhull commented 9 years ago

I wrote the original HN comment. Thanks for making this ticket for me. Just checking in, still interested in a general sense. Cheers

schakrava commented 9 years ago

@rickhull Thanks for checking in.

We continue to test DR scenarios and made progress since this issue was filed. Our findings lead to updates in product documentation: http://rockstor.com/docs/data_loss.html It is in part due to your feedback, so thank you.

The core problem is that with the current version of the kernel that is shipped with Rockstor(default centos kernel) is quite old as far as advances in BTRFS are concerned. We've been testing latest kernels from elrepo and as of 3.17, pool rebalance times are painfully long. We hope that as our understanding and experiments with DR testing gets better and we hope that with next kernel, we can expect consistent and acceptable behavior. When that happens we'll ship Rockstor with a later kernel from elrepo and provide more precise documentation.

For these reasons, I still consider this work far from over. But we are fully aware and would like to provide better documentation as soon as possible. For now though, please do read http://rockstor.com/docs/data_loss.html and provide your feedback.

phillxnet commented 7 years ago

This issue was written prior to our now standard use of the elrepo kernel ie 4.10 as of writing and now fairly regularly updated.

Linking to related issue re pending improvements to our "Data loss Prevention and Recovery in Rockstor" section #167 "Improve loss prevention and recovery section" which specifies a particular short coming.

We have had additional improvements to this section in the mean time ie pr #143 "Rewrite raid56 recovery section to correct errors and remove redundan…", pr #147 "update Data loss Prevention with stronger raid5/6 warning", and #162 "Fix typo and minimum number of drives for raid 5/6".

phillxnet commented 2 years ago

Closing re no further attention and with reference to the improvements indicated by @schakrava and those indicated in my last comment here 4 years ago.

N.B. we also now have Web-UI header warnings and 'wizards' re disk removal / degraded remount in-Web-UI for errors on or missing disk scenarios.

Closing as part of a repo clean-up.

rockstor / rockstor-doc

Add documentation describing DR situations #37