[Remote Translog] Fail requests on isolated shard

Problem statement

Document type replication

For document replication backed index, if a shard is configured with one or more replicas, an isolated primary can realise that it is no more the current primary when it fans out the incoming request to all the replicas and the replica does primary term validation. If no replicas are configured, the isolated primary can continue to accept the incoming writes depending upon the cluster.no_cluster_manager_block setting. If cluster.no_cluster_manager_block is write, then indexing requests would be returned with HTTP status code 5xx. If cluster.no_cluster_manager_block is metadata_write, then metadata writes would be rejected but indexing would continue to happen forever. Since there is no auto recovery of the failed primary (without replicas), this would require manual intervention for joining the isolated primary back to the cluster.

Remote backed index

For an index backed with remote store, if a shard is configured with one or more replicas, an isolated primary can still realise that it is no more a primary while doing the primary term validation. This will ensure that the indexing requests are not acknowledged when cluster manager promotes an in-sync replica to the primary. However, when a shard is configured with no replicas, if cluster.no_cluster_manager_block is metadata_write, then the isolated primary can continue to accept indexing requests forever. In remote store vision, we also plan to have auto restore of red index from remote store as remote-backed indexes have request level durability. However, a shard with no replica hinders the auto restore of red index as the isolated primary can continue to accept the writes and acknowledge the same while the new primary shard also restores from remote store and starts accepting writes. The isolated primary continues to accept the writes when it is acting as the coordinator. If the request falls on a node which is not isolated, then it would forward the request to the right primary shard.

If the master block setting cluster.no_cluster_manager_block is set to write then it would start failing the writes after LeaderChecker on the concerned node has exhausted all retries and has applied global level cluster block. This ensures that there are no new requests that would be accepted. However, if there were requests accepted before the global level cluster block was applied and then there are long GC pauses, it is possible that auto restore kicks in before the request is acknowledged. This can lead to data loss.

Note - Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.

Approach

Re-enable cluster.no_cluster_manager_block to write for remote translog backed indexes.
Use last successful leader check call time - If the last leader checker call succeeded more than x minutes (leader checker time threshold) ago, then we fail the request even though the translog remote sync has gone though. This ensures that once auto restore has kicked in, there are no more requests that would be acknowledged. Also, the auto restore time should be higher than the leader checker time threshold with enough buffer for clock errors & resolution)

The above approach has most benefit when there are no more active replicas or the replica configured is 0.

Execution

Planning to implement this in 2 phases -

Phase 1 - Fail the writes that have elapsed time above threshold over the last successful leader checker call.
Phase 2 - Fail the shards to stop accepting any write once we discover that the last successful leader checker call has elapsed the threshold defined earlier.

Usecases

Network partition when replicas are configured
Network partition without any replicas
Asymmetric network partitions

Considerations

Search / Indexing behaviour during isolation
Sync / Async durability modes both should adhere to the invariant of not acknowledging any writes that can get lost.

References

This is a good proposal.

Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.

Is checking for cluster block is sufficient? What if the block is removed by the time the response is returned?

The auto restore time should be higher than the leader checker time threshold.

How will you achieve this?

Your proposal assumes that leader checker is also a strict leasing system, in addition to fault detection. By design, it seems so. But I am not sure if implementation was done with this guarantee in mind. Is there any use-case where we use leader checker as a leasing system?

Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.

Is checking for cluster block is sufficient? What if the block is removed by the time the response is returned?

This check is different and is using the latest leader check time to determine whether to acknowledge the request or fail it. This is not the same cluster block. Cluster block trigger is reactive in nature. GC pauses can lead to cluster block not even getting applied. Hence we need to rely on latest successful leader check time.

The auto restore time should be higher than the leader checker time threshold.

How will you achieve this?

We will put setting validator to ensure that the auto restore time is greater than the custom leader checker time threshold. Please not this leader checker time threshold is the new threshold that is applicable for remote translog backed indexes only.

Your proposal assumes that leader checker is also a strict leasing system, in addition to fault detection. By design, it seems so. But I am not sure if implementation was done with this guarantee in mind. Is there any use-case where we use leader checker as a leasing system?

Fault detection constitutes of Leader check and Follower checker. Both in tandem currently handles addition and removal of node and consequent shard reallocation. Currently, the global level cluster blocks on master are triggered on account of Leader checker failing or Follower checker informing the follower nodes about it going down. With this change, the idea is that we rely on leader checker call's timestamp to determine if we should now start failing the requests and subsequently fail the local shard as it has been considerable time since it last talked to the cluster manager.

Code references of leader checker getting used to change global level cluster blocks - https://github.com/opensearch-project/OpenSearch/blob/5f8193021215cd6979fde66474ec8d74d32ac91a/server/src/main/java/org/opensearch/cluster/coordination/LeaderChecker.java#L293-L324

OnLeaderFailure - https://github.com/opensearch-project/OpenSearch/blob/5f8193021215cd6979fde66474ec8d74d32ac91a/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java#L304-L312

https://github.com/opensearch-project/OpenSearch/blob/5f8193021215cd6979fde66474ec8d74d32ac91a/server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java#L687-L689

I have raised the PR for the first phase change - https://github.com/opensearch-project/OpenSearch/pull/6680. Pls do review the same.

I have shared specific feedback on the PR. A few issues that I see:

We should rely on clock speeds and not an exact clock time. It’s difficult to ensure that system clocks across a cluster are synchronized enough to depend on that synchronization for ordering or coordinating distributed operations.
Leader check interval acts as a lease interval. Lease interval should always be higher than the heartbeat time(leader check interval). It is strange that write block is applied without looking at the lease interval. Does it rely on an internal lease interval which is a multiple of heartbeats? If yes, why take this input from user and not rely on an internal value?
Related to point 2, the error message should be consistent whether the response is returned before the request or after the request, else it will be confusing.
Is "no lease expiry" enough for correctness? What if lease was expired, primary was assigned somewhere else, followed by a successful heartbeat and before cluster state could be applied to fail the shard, the response was returned?
The elapsed time is updated in a separate background flow than the write path. If the background thread dies, the elapsed lease time may not be updated, causing correctness guarantees to break.
I am not sure if there are more cases like 4 and 5. The primary concern I have is that we are using a existing implementation for leases assuming that it was built with those guarantees in mind. However, I have not seen enough evidence from the use case you explained above. It is more of a fault detection system.

Thanks for these points and reply -

We should rely on clock speeds and not an exact clock time. It’s difficult to ensure that system clocks across a cluster are synchronized enough to depend on that synchronization for ordering or coordinating distributed operations.

Responded https://github.com/opensearch-project/OpenSearch/pull/6680#discussion_r1140591495

Leader check interval acts as a lease interval. Lease interval should always be higher than the heartbeat time(leader check interval). It is strange that write block is applied without looking at the lease interval. Does it rely on an internal lease interval which is a multiple of heartbeats? If yes, why take this input from user and not rely on an internal value?

This makes sense. I think we can make this as a function of following 3 settings - leader checker interval, leader checker timeout and leader checker retry count. That way, to a user, this should be implicit always and user does not need to configure on their end. I had kept this configurable to have a knob around auto restore time as well. However, it would be simpler if we could have both computed automatically using existing leader checker values.

Related to point 2, the error message should be consistent whether the response is returned before the request or after the request, else it will be confusing.

Makes sense, will make the change.

Is "no lease expiry" enough for correctness? What if lease was expired, primary was assigned somewhere else, followed by a successful heartbeat and before cluster state could be applied to fail the shard, the response was returned?

Responded https://github.com/opensearch-project/OpenSearch/pull/6680#discussion_r1140579114

The elapsed time is updated in a separate background flow than the write path. If the background thread dies, the elapsed lease time may not be updated, causing correctness guarantees to break.

This is a good point and there could be chances that this can happen. However, this change would not cause any correctness issues. In cases like this, we will fail the request (i.e. not acknowledge), however we might have the data present. We are making sure that an acknowledged request does not go missing. Also, as a follow up we can enhance follower checker to trigger leader checker to make sure that leader checker gets revived if the background thread dies silently. In addition to the trailing points, in the follow up phase the plan is to fail the shard. This will lead to no more acknowledgement of requests for the failing shard and also if the node can reach the cluster manager, it will lead to either another replica becoming primary or the timer for auto restore to kick in.

I am not sure if there are more cases like 4 and 5. The primary concern I have is that we are using a existing implementation for leases assuming that it was built with those guarantees in mind. However, I have not seen enough evidence from the use case you explained above. It is more of a fault detection system.

Given we are defensively failing the requests for more cases similar to above, I think it should be safe even when considering network isolation. I have also considered asymmetric network partitions (cc @shwetathareja), and they are getting handled with this.

Thanks @ashking94 for the proposal. It looks interesting and thinking more to see if it can give correctness guarantees for durability.

Couple of points to discuss (Calling the data node n1 which is hosting the shard)

LeaderChecker is running on n1 so it helps n1 decide that it can put the no_cluster_manager_block once all the attempts are exhausted. But, like @itiyama pointed out this is not strict lease system and simply fault detection and in case of GC pauses or otherwise it can take longer to detect that its LeaderChecks are failing.
The auto restore threshold should be a factor of FollowerChecker and not LeaderChecker. The leader is the source of truth of the cluster state. As soon as leader identifies that a follower is not responding to pings, it will remove the data node n1 hosting the shard from the cluster state.
Btw, LeaderChecker and FollowerChecker are node scope settings and there is a possibility these values potentially differ from node to node. These are not cluster wide settings. If you make them cluster wide settings or keep node level, then there could be race condition how they are updated or if nodes have stale values and would mess with auto restore threshold calculation.
FollowerChecker will start succeeding even before the latest cluster state is applied via ClusterApplierService.java on the data node after the joining the cluster. So, there is potential race condition where node would start processing the indexing request based on old state even though shard has been assigned to some other node and that node could have restored from remote store as well.
In case of node disconnection/ health check failures, FollowerChecker will not run all attempts for n1 from leader and would fail fast. This would also result in failing the node and removing from cluster state. Now, shards would turn red from leader perspective. Now, help me understand at what point auto restore of shards from remote store would kick in? Who triggers the auto restore? How are we guaranteeing this auto restore interval is always > threshold for failing the shard.
Like you mentioned right now the cluster block is checked at the start of the request. You would create a PR to add the logic for checking this threshold before acking the request to user. This would result in dirty writes. Remote store wouldn’t solve the direct writes problem.

So in nutshell, i see bunch of challenges with this approach to provide correctness guarantees for durability.

Leasing checks are pretty weak(leader checks just checks on terms, version can be stale) While it may true for node failures, the leader for all cases doesn’t wait for lease duration to expire, so there maybe edge cases we might miss like a close followed by restore where the close might have completed publication round but still executing, threads can be in a stuck state etc.

While for GC and N/W failure most of the leader check can serve as a leasing system but it is not guaranteed. Bugs in the system would tend to break guarantees. Would suggest, we invest in hardening leases rather than focusing on what is guaranteed by the existing system(It makes no leasing guarantees today).

Thanks everyone for the comments. After thinking more about this, for the problems mentioned above, we would need to build a strict leasing system which works by acquiring lease (or locks). Currently cluster manager is responsible for communication cluster state changes to everyone in the cluster. From there on primary term validation helps to identify stale primaries and fail them. This might need investment in making the cluster manager also act as a distributed lock/lease system and then every primary shard will need to acquire/renew/release locks from the elected active cluster manager. We will circle back on this as without this auto restore might be tough.

opensearch-project / OpenSearch

[Remote Translog] Phase 1 - Fail requests on isolated shard if no successful leader checker call in last x minutes to enable auto restore #6737