guardian / elasticsearch-node-rotation

Step Function for rotating nodes in an Elasticsearch cluster
MIT License
3 stars 2 forks source link

Falling behind when there are too many nodes to refresh... #87

Open rtyley opened 2 years ago

rtyley commented 2 years ago

Our team currently has two big-ish Elasticsearch clusters - Elasticsearch 6 & Elasticsearch 7, and that ends up being a lot of ES nodes - about 50 nodes or so. Our current node rotation schedule is failing to keep up:

image

This is because:

We can widen that rotation period, but precisely scaling cron schedules is quite fiddly (eg currently, we have to be very careful to make sure they are never more frequent than the slowest possible migration). It would be nice to have a better way to scale this...

Let's respect ageThresholdInDays - don't stop until all nodes are younger

Thanks to https://github.com/guardian/elasticsearch-node-rotation/pull/68, we now have the ageThresholdInDays parameter (for Ophan, it is 7 days). At the moment it just means:

Don't rotate any node that is younger than ageThresholdInDays

How about if instead it meant:

The Step Function will not terminate until all nodes are younger than ageThresholdInDays

...then, rather than scheduling the Step Function to run multiple times a day, we could just cron it to run once per day.

How can the ENR Step Function achieve that?

A few options:

twrichards commented 2 years ago

how about we add another step to the end of the process, which checks if there are further nodes to rotate (that meet the criteria) and kicks off another instance of itself (the step function) - that way each rotation has its own step function invocation - making debugging easier.

twrichards commented 2 years ago

this proposed behaviour could be configurable via the input event, perhaps with an optional maxRotations (defaults to one, if not present) then the schedule could be set to say 10 in your case (given 5 weekdays)

twrichards commented 2 years ago

also, this seems to be a dupe of https://github.com/guardian/elasticsearch-node-rotation/issues/34