guardian / elasticsearch-node-rotation

Step Function for rotating nodes in an Elasticsearch cluster
MIT License
3 stars 2 forks source link

Modify the removeNode step so that the operations terminateInstanceInASG and excludeFromAllocation run in sequence rather than in parallel #54

Open simone-smith opened 4 years ago

simone-smith commented 4 years ago

We on the Ophan team had an incident on 11th August (incident retro doc) whereby an unassigned primary shard on an index in our Elasticsearch cluster sent the cluster into a red state, leaving Big Loader unable to write into it and causing lag on Dashboard.

The cause of the issue was the concurrent Promises at the end of this step:

https://github.com/guardian/elasticsearch-node-rotation/blob/b196ac43b0ef7e1357cf79f9fb7876ecaeabe935/src/removeNode.ts#L19-L22

We encountered a race condition, where the second operation (excludeFromAllocation) completed before the first (terminateInstanceInASG) - in the space of approximately 10 seconds, the node that was being rotated out of service was removed from the list of excluded IP addresses and a small shard from an index with 0 replicas was immediately allocated to it. This shard was so small that the allocation had completed by the time the node was terminated, which caused us to permanently lose the data on that node.

Modifying these operations so that they run one after another, or even moving the excludeFromAllocation operation to a separate, final, step in the step function, will prevent this race condition from recurring.

jacobwinch commented 4 years ago

even moving the excludeFromAllocation operation to a separate, final, step in the step function, will prevent this race condition from recurring.

👍 for this suggestion. I think this would also help to make it more obvious that the exclusions list needs to be cleared if a Step Function execution fails midway through.