oracle / coherence

Oracle Coherence Community Edition
https://coherence.community
Universal Permissive License v1.0
427 stars 70 forks source link

Allow auto scaling without making cluster vulnerable #68

Open javafanboy opened 2 years ago

javafanboy commented 2 years ago

Today there is (according to answer to a previous question) no way for a storage enabled cluster member to signal that it intends to leave the cluster and stigger re-balancing of its partitions in an safe and ordered way.

As a consequence it is a risky operation to "scale in" a coherence cluster (with the standard backup count of one we are vulnerable to data loss if a storage enabled node would fail while a node has terminated due to scale in). Having to use backup count of two is a possible work-around but results in a significant increase in memory use and reduced performance of updates...

A way to address this would be to provide a way to programatically indicate that a storage enabled node would like to leave and that this would initiate an orderly re-balancing of the partitions WHILE THE NODE IS STILL AVAILABLE to avoid data loss in the case of another node failing during the re-balancing.

Once this re-balancing is completed the node would be able to terminate and the scale-in operation completing.

Having this kind of functionality would be a big step taking coherence into the "cloud era" allowing cluters to not only scale out but also scale in to track needed capacity.

thegridman commented 2 years ago

Even if we had a way to for a member to signal that it was going to leave, you then still need to verify that rebalancing has completed before killing the next member. In the next release of Coherence (and in the current CE snapshot) we have a new health check API. This also exposes a health endpoint. With this you will be able to just kill a member, then check the health endpoint of one of the remaining members to verify when it is "safe" to kill the next member. The 22.06-SNAPSHOT docs for this are here https://coherence.community/22.06-SNAPSHOT/docs/#/docs/core/11_health

thegridman commented 2 years ago

On the subject of "cloud", when running in K8s, we have a Coherence Operator that makes managing Coherence cluster in K8s much simpler. It safely scales clusters and works with the k8s autoscaler.

javafanboy commented 2 years ago

Limiting auto scaling to not scale in too quickly (only one at a time) is, at least in AWS, not a problem (and beeing slow at scaling in is anyhow a best practise). Number of VMs to scale in at once can be set to one and pacing can be solved can be done by setting a long "grace period" or using hooks that signal when a node is ready to be terminated after scale in.

In our case we do not see any value with Kubernetes (we have no need for the portability or all the flexibility of components) and is not keen to handle all the complexity that comes with it (even in the cloud) and in particular the k8 project move too quickly for us (for instance if running k8 in EKS @ AWS you are forced to accept major updates to the cluster at least a few times a year - not sure how other cloud providers handles not having excessively old versions of K8 running) with the need to perform major re-verification of all systems with this frequency to see that they still operate as expected (k8 sometimes introduces breaking changes or bugs so this is not something you simply upgrade and keep your fingers crossed for critical system)...

In other words not all users that would like to use Coherence in a cloud are on k8.

Besides I speculate the proposed improvement would make your own "Operator" project much simpler...

thegridman commented 2 years ago

I agree with the pain points of k8s - when you mentioned "cloud" in your original post, I wasn't sure whether you meant k8s or not. We are certainly seeing teams that naively, never really thought about failure scenarios before suddenly having to deal with it when a DevOps team suddenly takes away a k8s Node with zero notice, or does a rolling upgrade of their k8s cluster.

So outside of k8s, "cloud" as seen by Coherence, is not really any different to on-premise physical boxes. Coherence is deployed on a set of VMs that happen to be in a cloud. As far as Coherence is concerned it knows no difference between VMs, and physical boxes. It knows there are a number of JVMs in a cluster, and which hosts they are running on, be that VMs, or bare metal.

In this case, assuming you have properly balanced your Coherence deployment across these VMs, i.e. the number of storage enabled members is pretty evenly distributed across VMs, then you can scale down by a whole VM at once and not lose data. Or in the case of multiple availability domains (typically data centres) you could lose a whole DC and not lose data (assuming you have capacity in the rest of the cluster to deal with that (which I guess is what you are talking about in the Making Coherence CE more cloud friendly discussion.

javafanboy commented 2 years ago

Exactly my point that Coherence today is built with "on premises" mindset with a mostly static number of nodes. The problem with scaling down one node at a time by "killing it" is as described in the discussion thread (as I understand it - correct me if I am wrong) that if a failure happens while the first rebalancing from the scaling event is going on we will lose data (unless backup count is larger than 1)...

We consider the risk of two VM failures (in particular in different racks = in our case AZs) to happen at the same time to be VERY small but if regularly performing scale in (at least a few times a day perhaps) the risk of data loss increases as it just takes one failure during one of those daily periods...

Suggests we place any more thoughts under the discussion thread where it may be visible to more users (even though I feel pretty alone discussing things here - this is a project I think would deserve a lot of more attention and users as it is a very stable and nice "piece of tech" even though my complaints may sound like I did not think so)...

aseovic commented 2 years ago

@thegridman I think you may be missing the (valid) point @javafanboy is trying to make: scaling in one node at a time, the way both the Operator and AWS auto scaler allow you to do is fine, and should work fine most of the time. It is effectively a controlled failure, and Coherence is more than capable of dealing with it, and rebalancing the data as needed.

The issue is that it does open a window (however small, and depending on the size of the members it may not be all that small either) during which another, this time accidental failure would likely result in a data loss.

I like the idea of scaling in in a more orderly fashion, and allowing for accidental failure during scale-in.

The signaling is not an issue: we already have a shutdown hook for SIGTERM that at the moment simply calls Cluster.shutdown, which in turn calls Service.shutdown for each running service. In theory, that should be sufficient, but unfortunately PartitionedService.shutdown doesn't really do much -- it waits 5 seconds for the service to fully start (if not already started), another 5 seconds to drain the event dispatcher queue, and then calls Service.stop, effectively doing an internal kill -9 and terminating the process.

I think we can do better than that when it comes to "graceful shutdown" by:

  1. Relinquishing all primary and backup data to other members
  2. Fully draining the event queue
  3. Ensuring the service is not endangered
  4. Shutting down

We may need to get @mgamanho involved to hash out the details, but I do think it's doable and would significantly improve "graceful shutdown" experience and enable safe scale in. Most of the necessary infrastructure seems to be already in place.

javafanboy commented 2 years ago

Exactly my point Aleks - you described it better than I could!

mgamanho commented 2 years ago

I think it would be a great addition.

Especially, as you mention @aseovic , that DCS.shutdown() already does part of the work. It does attempt to transfer primary partitions out BTW, but eventually forces a shutdown. More importantly, there is no "shutdown-rollback" mode to return it into the pool should the user want to and the cluster integrity be at risk.

javafanboy commented 1 year ago

Any chance of this proposal to acctually be implemented?

aseovic commented 1 year ago

@javafanboy Yes. I've created a JIRA internally for this work back in May, and we've had a few discussions on how we would actually do it, and have agreed on the high-level changes we'd need to make to support this.

However, as is often the case, the problem is finding and allocating the time to do it, as the few developers who have the expertise to work on it have plenty of other tasks in the backlog as well.

It is fairly high on a priority list, though, so I'm hoping we can get to it in not too distant future.

javafanboy commented 1 year ago

Any news on this?

aseovic commented 8 months ago

@javafanboy Sorry, completely missed your question above from 6 months ago, and just found it as we were going through all open issues in various places to close the year.

Unfortunately, don't have much in a way of an update -- the issue I created is still in the backlog and either @mgamanho or I still need to find some time to work on it. As much as we'd both love to do that, we had our plate full this year with a number of things that were deemed higher priority, for better or worse :-(