Open ashking94 opened 1 year ago
This is a good proposal.
Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.
Is checking for cluster block is sufficient? What if the block is removed by the time the response is returned?
The auto restore time should be higher than the leader checker time threshold.
How will you achieve this?
Your proposal assumes that leader checker is also a strict leasing system, in addition to fault detection. By design, it seems so. But I am not sure if implementation was done with this guarantee in mind. Is there any use-case where we use leader checker as a leasing system?
Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.
Is checking for cluster block is sufficient? What if the block is removed by the time the response is returned?
This check is different and is using the latest leader check time to determine whether to acknowledge the request or fail it. This is not the same cluster block. Cluster block trigger is reactive in nature. GC pauses can lead to cluster block not even getting applied. Hence we need to rely on latest successful leader check time.
The auto restore time should be higher than the leader checker time threshold.
How will you achieve this?
We will put setting validator to ensure that the auto restore time is greater than the custom leader checker time threshold. Please not this leader checker time threshold is the new threshold that is applicable for remote translog backed indexes only.
Your proposal assumes that leader checker is also a strict leasing system, in addition to fault detection. By design, it seems so. But I am not sure if implementation was done with this guarantee in mind. Is there any use-case where we use leader checker as a leasing system?
Fault detection constitutes of Leader check and Follower checker. Both in tandem currently handles addition and removal of node and consequent shard reallocation. Currently, the global level cluster blocks on master are triggered on account of Leader checker failing or Follower checker informing the follower nodes about it going down. With this change, the idea is that we rely on leader checker call's timestamp to determine if we should now start failing the requests and subsequently fail the local shard as it has been considerable time since it last talked to the cluster manager.
Code references of leader checker getting used to change global level cluster blocks - https://github.com/opensearch-project/OpenSearch/blob/5f8193021215cd6979fde66474ec8d74d32ac91a/server/src/main/java/org/opensearch/cluster/coordination/LeaderChecker.java#L293-L324
I have raised the PR for the first phase change - https://github.com/opensearch-project/OpenSearch/pull/6680. Pls do review the same.
I have shared specific feedback on the PR. A few issues that I see:
Thanks for these points and reply -
- We should rely on clock speeds and not an exact clock time. It’s difficult to ensure that system clocks across a cluster are synchronized enough to depend on that synchronization for ordering or coordinating distributed operations.
Responded https://github.com/opensearch-project/OpenSearch/pull/6680#discussion_r1140591495
- Leader check interval acts as a lease interval. Lease interval should always be higher than the heartbeat time(leader check interval). It is strange that write block is applied without looking at the lease interval. Does it rely on an internal lease interval which is a multiple of heartbeats? If yes, why take this input from user and not rely on an internal value?
This makes sense. I think we can make this as a function of following 3 settings - leader checker interval, leader checker timeout and leader checker retry count. That way, to a user, this should be implicit always and user does not need to configure on their end. I had kept this configurable to have a knob around auto restore time as well. However, it would be simpler if we could have both computed automatically using existing leader checker values.
- Related to point 2, the error message should be consistent whether the response is returned before the request or after the request, else it will be confusing.
Makes sense, will make the change.
- Is "no lease expiry" enough for correctness? What if lease was expired, primary was assigned somewhere else, followed by a successful heartbeat and before cluster state could be applied to fail the shard, the response was returned?
Responded https://github.com/opensearch-project/OpenSearch/pull/6680#discussion_r1140579114
- The elapsed time is updated in a separate background flow than the write path. If the background thread dies, the elapsed lease time may not be updated, causing correctness guarantees to break.
This is a good point and there could be chances that this can happen. However, this change would not cause any correctness issues. In cases like this, we will fail the request (i.e. not acknowledge), however we might have the data present. We are making sure that an acknowledged request does not go missing. Also, as a follow up we can enhance follower checker to trigger leader checker to make sure that leader checker gets revived if the background thread dies silently. In addition to the trailing points, in the follow up phase the plan is to fail the shard. This will lead to no more acknowledgement of requests for the failing shard and also if the node can reach the cluster manager, it will lead to either another replica becoming primary or the timer for auto restore to kick in.
- I am not sure if there are more cases like 4 and 5. The primary concern I have is that we are using a existing implementation for leases assuming that it was built with those guarantees in mind. However, I have not seen enough evidence from the use case you explained above. It is more of a fault detection system.
Given we are defensively failing the requests for more cases similar to above, I think it should be safe even when considering network isolation. I have also considered asymmetric network partitions (cc @shwetathareja), and they are getting handled with this.
Thanks @ashking94 for the proposal. It looks interesting and thinking more to see if it can give correctness guarantees for durability.
Couple of points to discuss (Calling the data node n1 which is hosting the shard)
So in nutshell, i see bunch of challenges with this approach to provide correctness guarantees for durability.
Leasing checks are pretty weak(leader checks just checks on terms, version can be stale) While it may true for node failures, the leader for all cases doesn’t wait for lease duration to expire, so there maybe edge cases we might miss like a close followed by restore where the close might have completed publication round but still executing, threads can be in a stuck state etc.
While for GC and N/W failure most of the leader check can serve as a leasing system but it is not guaranteed. Bugs in the system would tend to break guarantees. Would suggest, we invest in hardening leases rather than focusing on what is guaranteed by the existing system(It makes no leasing guarantees today).
Thanks everyone for the comments. After thinking more about this, for the problems mentioned above, we would need to build a strict leasing system which works by acquiring lease (or locks). Currently cluster manager is responsible for communication cluster state changes to everyone in the cluster. From there on primary term validation helps to identify stale primaries and fail them. This might need investment in making the cluster manager also act as a distributed lock/lease system and then every primary shard will need to acquire/renew/release locks from the elected active cluster manager. We will circle back on this as without this auto restore might be tough.
[Remote Translog] Fail requests on isolated shard
Problem statement
Document type replication
For document replication backed index, if a shard is configured with one or more replicas, an isolated primary can realise that it is no more the current primary when it fans out the incoming request to all the replicas and the replica does primary term validation. If no replicas are configured, the isolated primary can continue to accept the incoming writes depending upon the
cluster.no_cluster_manager_block
setting. Ifcluster.no_cluster_manager_block
iswrite
, then indexing requests would be returned with HTTP status code 5xx. Ifcluster.no_cluster_manager_block
ismetadata_write
, then metadata writes would be rejected but indexing would continue to happen forever. Since there is no auto recovery of the failed primary (without replicas), this would require manual intervention for joining the isolated primary back to the cluster.Remote backed index
For an index backed with remote store, if a shard is configured with one or more replicas, an isolated primary can still realise that it is no more a primary while doing the primary term validation. This will ensure that the indexing requests are not acknowledged when cluster manager promotes an in-sync replica to the primary. However, when a shard is configured with no replicas, if
cluster.no_cluster_manager_block
ismetadata_write
, then the isolated primary can continue to accept indexing requests forever. In remote store vision, we also plan to have auto restore of red index from remote store as remote-backed indexes have request level durability. However, a shard with no replica hinders the auto restore of red index as the isolated primary can continue to accept the writes and acknowledge the same while the new primary shard also restores from remote store and starts accepting writes. The isolated primary continues to accept the writes when it is acting as the coordinator. If the request falls on a node which is not isolated, then it would forward the request to the right primary shard.If the master block setting
cluster.no_cluster_manager_block
is set towrite
then it would start failing the writes after LeaderChecker on the concerned node has exhausted all retries and has applied global level cluster block. This ensures that there are no new requests that would be accepted. However, if there were requests accepted before the global level cluster block was applied and then there are long GC pauses, it is possible that auto restore kicks in before the request is acknowledged. This can lead to data loss.Note - Cluster blocks are checked at the entry when a transport call is made. For remote store cases, we must ensure that there is a check even at the end of the call so that if auto restore has happened, it does not acknowledge any upload that started long back.
Approach
cluster.no_cluster_manager_block
towrite
for remote translog backed indexes.The above approach has most benefit when there are no more active replicas or the replica configured is 0.
Execution
Planning to implement this in 2 phases -
Usecases
Considerations
References