POC: Handling failover with remote segment storage

sachinpkale commented 2 years ago

The way we store segments in remote segment storage (Feature Proposal) depends on how we handle failover.

In document based replication, segments are created separately on primary and replicas and the process is not in sync. This means, number of segments can be different. Also, if we inspect the segments for primary and replica, there can be a difference and it depends on when the segment was created on the given node and translog checkpoint at that time. This does not mean data will be inconsistent. With the help of translog, the end state still remains same.

As there isn’t a consistent state of segments across primary and replicas, when primary goes down and one of the replicas become primary, the segments in the remote store and segments in the primary will differ. Once new primary starts uploading new segments to the remote store, we need to make sure that a consistent state is maintained. This becomes tricky once segment merge happens at the new primary and older segments need to be deleted.

Goal of this POC is to list down potential approaches to handle the failover and recommend one based on pros and cons. This failover approach will dictate the overall design of how segments will be stored.

dblock commented 2 years ago

Tagging @mch2 @kartg @andrross for feedback!

andrross commented 2 years ago

Sort of thinking out loud here, but I see at least two possible broad approaches. At the time of failover, the new primary:

Reconciles its local segment state with the state that is in the remote store. As long as the new primary has all the documents newer than the remote store checkpoint in its translog, it should be possible to get to a consistent state with no data loss.
Reconciles the remote storage to match its local segment state. This might result in the remote store checkpoint going backward in logical time if the new primary's local segment state happens to be older than what was pushed to the remote store by the previous primary.

Are there any other options that you're thinking about here? I think both of these approaches will potentially make failover a more heavyweight operation than it currently is.

sachinpkale commented 2 years ago

@andrross Yes, these are two of the potential approaches. Let me list down all the potential approaches below.

sachinpkale commented 2 years ago

Potential Approaches

As mentioned in the description of the issue, these approaches are to handle failover case with document based replication. For Segment based replication, it would be comparatively easy as segment files do not change between primary and replicas (we still would want to have the same approach work for both the replication strategies).

Sync all data from remote store to new primary before failover is complete

Even though this approach guarantees to have consistent state, we can’t control time required in failover.
Given that failover today is almost immediate, going with this approach is not feasible.
Sync all data from new primary to remote store before failover is complete
This is similar to previous one but data flows from node to remote store.
Even though this approach guarantees to have consistent state, we can’t control time required in failover.
Given that failover today is almost immediate, going with this approach is not feasible.
Sync all data from new primary to remote store in the background, commit only after data sync completes
This is a variation of previous approach but will not impact failover time.
As translog will grow until upload of all the segments from new primary completes, we need to check the performance impact of larger translog (This may need separate smaller POC)
One other variant of this approach would be to sync with remote translog. Commits will be triggered in the same way as they are triggered today on the new primary. Local translog on new primary will be purged on these commits. But the remote translog will not be purged until the segment upload completes.
Every node keeps keeps own copy in remote store
Not a feasible approach as number of copies of data will grow with number of replicas and we will not have any control on it.
Keeping 3 copied of data from 3 different nodes
Variation of the above approach. We always keep 3 copies of data - 1 primary and 2 replicas (less if shard has < 2 replicas)
This will keep number of copies in remote store to a fixed number but would still be higher than ideal 1 copy.
Also, to manage quorum of nodes having copy in remote would itself be a complex problem. What happens if failover happens from node N1 -> N2 -> N3 -> N4 almost instantly and N4 did not have a copy in remote store?
Keep a completely different copy of segments on the remote store
A separate process (on a data node or a completely different node) consumes remote translog, creates segments and uploads to remote segment store.
This way, we don't have to sync segments from node to remote store and failover would be seamless.
As segment creation is a heavy process, we are adding extra compute load on the existing cluster.
Incremental upload only new segments of new primary to remote store
This is the most efficient (in terms of time, compute and storage) approach and ideally be the first option. But having different segment files makes it tricky (or impossible?) to identify the diff and upload only the diff to remote store.
In this POC, we will anyhway check what happens if we try to achieve this.

sachinpkale commented 2 years ago

Next steps:

Quick check to validate some of the potential approaches
Add a table with Pros and Cons of potential approaches
Recommend an approach.

andrross commented 2 years ago

For Segment based replication, it would be comparatively easy as segment files do not change between primary and replicas (we still would want to have the same approach work for both the replication strategies)

For the sake of argument, is it truly a requirement that remote storage needs to work with both replication strategies, particularly for the initial version? It seems like a whole bunch of complexity could be avoided if this feature was only supported with segment replication. Just from a usability perspective these options result in performance and cost tradeoffs to be considered when enabling remote storage with document replication that can be avoided if segment replication is used.

sachinpkale commented 2 years ago

This POC will help us in the design of remote store and IMO remote store design should not be tied to a replication strategy.

From release point of view, I agree with you. We can think of having an experimental first release of remote store limited to Segment Replication. But interfaces should be defined in such a way that supporting document based replication will be an extension.

kartg commented 2 years ago

Thank you for the thorough walkthrough of the various approaches above!

IMO, we shouldn't rule the two "sync all data" approaches (the first two options listed above) from the get-go, especially considering that they are the most straight-forward to implement. I'm sure the use of remote segment storage will introduce a number of other tradeoffs, so it would be worth it to consider the impact of failover latency alongside those.

Building on what @andrross said above, I don't think we can integrate remote segment storage alongside document/segment replication. If the purpose of the remote segment store is to be the authoritative source of data and guard against data loss when nodes go down, then it must work in tandem with primaries and replicas. We would need a new replication strategy from document and segment replication - one where the primary processes documents and ultimately writes segments only to the remote store. Replica shards would simply pull data from the remote store and only serve to minimize failover time should the primary become unavailable. Thoughts?

kartg commented 2 years ago

One final, pie-in-the-sky thought 😄 Down the line, it would be worth considering the benefits of allowing the remote store to independently optimize its stored data rather than simply mirroring the primary. That way, we don't need to expend network bandwidth just to mirror the primary shard's process of slowly merging down to larger and larger segments.

andrross commented 2 years ago

@kartg makes a good point about a possible architecture where replicas pull from the remote store and essentially use the remote store as the replication method. There are a few variables here: remote segment store, remote translog, and replication method. It seems to me that there are a few permutations that probably don't make sense. For example, would you ever want to use document replication along with a remote translog? I would think not because a benefit of a remote translog is that you don't need to replicate the translog to replicas at all. Maybe I'm wrong about that, but it might be helpful to detail all the use cases being targeted here to help answer some of these questions and refine the design.

anasalkouz commented 2 years ago

I agree with @andrross, why remote storage should work with document replication? What is the benefit of doing this? I think we need to trade between the benefit of enabling this vs the complexity we will add.

sachinpkale commented 2 years ago

We are combining two things here: Durability and Replication. Durability should be achieved irrespective of what replication strategy we choose. Durability feature will make sure that there will not be any data loss when outage happens. Let me know if you disagree with this.

We would need a new replication strategy from document and segment replication - one where the primary processes documents and ultimately writes segments only to the remote store. Replica shards would simply pull data from the remote store and only serve to minimize failover time should the primary become unavailable.

@kartg This is a feature that can be built using remote segment store but again not related to durability feature.

it would be worth considering the benefits of allowing the remote store to independently optimize its stored data rather than simply mirroring the primary

Is it similar to Keep a completely different copy of segments on the remote store approach listed above?

would you ever want to use document replication along with a remote translog? I would think not because a benefit of a remote translog is that you don't need to replicate the translog to replicas at all

@andrross Yes, to provide durability guarantees, we need to use remote translog along with document replication. I am not sure if we are referring to the same concept when we say remote translog, but remote translog referred in Durability Feature Proposal is only for durability purpose. Document replication is still required for actual replication of data.

why remote storage should work with document replication? What is the benefit of doing this?

@anasalkouz It is required to provide durability guarantees.

kartg commented 2 years ago

Durability should be achieved irrespective of what replication strategy we choose.... .... @kartg This is a feature that can be built using remote segment store but again not related to durability feature.

Durability and replication are separate considerations as long as we're only changing how we make things durable, or how we're replicating. With remote storage, we're changing where we're making things durable which affects where we replicate from.

I think we have an opportunity to build a really efficient implementation if we work together to ensure that remote segment storage and segment replication play well with each other, rather than trying to build them completely independent of one another.

wdyt?

sachinpkale commented 2 years ago

Completely agree. Not just replication, we can integrate remote store with other constructs/features of OpenSearch (like snapshot). While designing the remote store, we have to make sure the use case is extensible. We have started drafting design proposal here: https://github.com/opensearch-project/OpenSearch/issues/2700

sachinpkale commented 2 years ago

Approach: Incremental upload only new segments of new primary to remote store

Segments are uploaded to the remote storage in directory format: cluster_UUID/primary_term/index_hash/shard_number/. Segments file (segments_N) per commit will be used to keep track of max sequence number that is committed. Processed local checkpoint is added in segments_N file as a part of commit data. Currently, OpenSearch keeps segments_N file only for the last successful commit and deletes the older ones Code Reference. In this approach, we will keep all the segments_N files. Check for impact on performance or scale. Can this result in reaching max open file limit earlier than current implementation? We will also upload segments_latest file which will point to the latest segments_N file.

A special primary_term_latest will be added under cluster_UUID/ which will hold the value of latest primary term. If primary fails, check the max sequence number of the commit in remote store, let’s call it as max_seq_remote. Check the max sequence number for the last commit in the new primary, let’s call it as max_seq_new_primary.

if max_seq_remote == max_seq_new_primary , all good, nothing to do.
if max_seq_remote > max_seq_new_primary , it means new primary’s translog has changes that are already on the remote store (this assumes active replica becoming new primary. For the cases, where replica is lagging, remote translog will cover the part of recovering the translog but this statement will still hold true). Once changes are committed on the new primary, new segments will be uploaded to remote storage. There will be no data loss. There will be data duplication for number of operations with sequence number [max_seq_new_primary, max_seq_remote].
if max_seq_remote < max_seq_new_primary, it means new primary has already committed changes which are not on the remote store. If we start uploading newly created segments from new primary, it will result in data loss. Operations between sequence number [max_seq_remote, max_seq_new_primary] will be lost. To avoid the data loss, before the replica is promoted to primary, we will go through segments_N files on replica in decreasing order till we reach condition number 2 ( max_seq_remote > max_seq_new_primary ). For all the segments_N files where max_seq_remote < max_seq_new_primary, we get segments files associated with the commit and upload them to remote store under new primary term.

Test case for restore and checking duplicates

3 nodes cluster - no dedicated master nodes

Indexing Flow (value of X can be changed as per the run)

Index X docs
Commit
Index X docs
Commit
Index X docs - Kill primary in between
Commit
Index X docs
Commit
Index X docs - Kill primary in between
Commit
Index X docs
Commit
Stop Commit flow
Check number of docs by OS query
Apply some aggregations as well

curl -X GET "node1:9200/my-index-2/_count?pretty"
{
  "count" : 54830,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

curl -X GET "node1:9200/my-index-2/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "my-agg-name": {
      "histogram": {
        "field": "field13", "interval": 10
      }
    }
  }, "size": 0
}
'
{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my-agg-name" : {
      "buckets" : [
        {
          "key" : 10.0,
          "doc_count" : 9800
        },
        {
          "key" : 20.0,
          "doc_count" : 9694
        },
        {
          "key" : 30.0,
          "doc_count" : 9737
        },
        {
          "key" : 40.0,
          "doc_count" : 9962
        },
        {
          "key" : 50.0,
          "doc_count" : 9765
        },
        {
          "key" : 60.0,
          "doc_count" : 5872
        }
      ]
    }
  }
}

Restore Flow

Download segment files from S3
Restore incomplete segment files.
Stop ES process. Replace segment files on all the three nodes
Start ES process.
Check number of docs by ES query
Apply some aggregations as well

curl -X GET "node1:9200/my-index-2/_count?pretty"
{
  "count" : 73100,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

curl -X GET "node1:9200/my-index-2/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "my-agg-name": {
      "histogram": {
        "field": "field13", "interval": 10
      }
    }
  }, "size": 0
}
'
{
  "took" : 81,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my-agg-name" : {
      "buckets" : [
        {
          "key" : 10.0,
          "doc_count" : 13035
        },
        {
          "key" : 20.0,
          "doc_count" : 13020
        },
        {
          "key" : 30.0,
          "doc_count" : 12942
        },
        {
          "key" : 40.0,
          "doc_count" : 13333
        },
        {
          "key" : 50.0,
          "doc_count" : 12999
        },
        {
          "key" : 60.0,
          "doc_count" : 7771
        }
      ]
    }
  }
}

Conclusion

The incremental approach creates duplicates and Lucene/OpenSearch has no mechanism to avoid these duplicates.
Going with this approach will not be feasible as the behavior significantly changes from the existing behavior.

sachinpkale commented 2 years ago

Recommended Approach: Sync all data from new primary to remote store in the background, commit only after data sync completes

We can use a variant of this approach. We take help of remote translog in this approach. Commits will be triggered in the same way as they are triggered today on the new primary. Local translog on new primary will be purged on these commits.

But the remote translog will not be purged until the segment upload completes.

sachinpkale commented 2 years ago

Adding more details in the design proposal: https://github.com/opensearch-project/OpenSearch/issues/2700

opensearch-project / OpenSearch