opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.3k stars 1.72k forks source link

Easy and fluent disaster recovery (3.0?) #11894

Open sandervandegeijn opened 7 months ago

sandervandegeijn commented 7 months ago

Is your feature request related to a problem? Please describe

So your cluster has blown up or the hardware has failed or.... It's unsalvageable and you to rebuild the whole thing. Luckily you have made snapshots so you should be able to restore it. Well it's not really easy, you have to account for:

Even if you're quite familiair with opensearch it can be complex, especially when your production environment has gone up in smoke and everybody is looking at you to restore everything asap

The point in time recovery of certain indices is quite easy through the UI, but a complete restore is not that easy (talking from experience here ;) )

Describe the solution you'd like

A clear and consistent flow that's user friendly that restores the whole cluster (including security settings, jobs, anomaly detectors, etc) to the state of the snapshot you want to restore. Without a ton of caveats to take into account, having to read a the docs with all the exceptions.

Restore is preferably done from the UI and can be executed by junior-medior level of sysadmins/devs.

This would also enable us to phase out or custom component that provisions all the settings from a git repository because we don't trust being able to restore everything after a loss of the cluster.

Trust is key here, I need to be able to trust the environment to be recoverable with my eyes closed.

Related component

Storage:Snapshots

Describe alternatives you've considered

Scripting everything (which I did), but this requires quite some knowledge of opensearch and it's internals. This should be as easy as possible with a polished experience.

Additional context

No response

peternied commented 7 months ago

[Triage - attendees 1 2 3 4] Thanks for calling out this area of improvement and details around how it would better work

reta commented 7 months ago

I believe it will came naturally when Remote Store (the writeable paths) is released: in this case, the storage and compute layers are separated, recovering the node (or even whole cluster) should be as easy as pointing to Remote Store location.

@andrross please correct me here.

Bukhtawar commented 7 months ago

The major caveat in auto-recovering is cases around network partitioning and ensuring we don't have an isolated writer acknowledging write requests while we auto-recover the shard data on some other node. This will lead to divergent writes if safety checks aren't in place. You might be interested in the issue. There are plans to support this feature https://github.com/opensearch-project/OpenSearch/issues/11921

andrross commented 7 months ago

@reta Yes, the plan with remote store is to enable automatic recovery in the case of any hardware failure, though as @Bukhtawar called out we're not quite there for all cases. However, as this issue documents, there are some significant pain points with snapshot-based disaster recovery where we could definitely make improvements.

sandervandegeijn commented 7 months ago

Yes, please. If I can help to review the plans, no problem. In all cases the recovery flow should be almost thoughtless and simple, if you're the one to recover the cluster while everyone stresses out around you it should as KISS as can be :)

linuxpi commented 3 months ago

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]

@sandervandegeijn Thanks for opening the issue. With recent Remote Store release we have simplified some part of the process. While adding some documentation around the new improvements will help.

We need to plan for and list out other improvements that can be taken up. One of those is already pointed by @Bukhtawar - https://github.com/opensearch-project/OpenSearch/issues/11921

We can debate more on whether such controls should be exposed via the UI.

sandervandegeijn commented 3 months ago

Great that there is progress. If I can help to review anything, no problem.