pingidentity / pingidentity-devops-getting-started

Ping Identity Devops Program
https://devops.pingidentity.com
Other
95 stars 135 forks source link

[Question]: How to handle Ping Directory autoscaling #484

Closed MihaiNuuday closed 1 year ago

MihaiNuuday commented 1 year ago

Hi team!

We've been diving into the devops documentation and having a few discussions with some of your colleagues from Ping Identity. But one topic is still a bit unclear to us: how is Ping Directory meant to scale horizontally when using the ping devops framework for kubernetes (helm)?

More specifically, I can break the question down into 2 parts:

What makes it a bit confusing is this part in the documentation: https://github.com/pingidentity/pingidentity-devops-getting-started/blob/master/30-helm/pingdirectory-scale-down/README.md Which comes with the question - would it be reasonable to have the prestop hook enabled at all times? Because if so, I'd imagine the answer to the question of scaling would be - yes, it can scale up and down without intervention. Is that right?

Is there any reason not to use the preStop hook at all times, other than the time it takes for a PD node to rejoin the cluster? Would you be able to give an anectodal observation of this impact on startup times (i.e. how much longer it takes for a node to rejoin)? And would you advise against keeping the preStop hook at all times?

One thing I would imagine, if there are no other complications, is that we would be able to even enable an HPA for Ping Directory, which would help solve our scalability worries in peak utilisation times (we have a use-case where burts of high traffic are expected on random ocasions). And if you have any pointers on how to use HPA with PingDirectory via your devops framework, that would certainly be appreciated.

What do you think?

henryrecker-pingidentity commented 1 year ago

Scaling up will work just as it does when building out the original pods of a statefulset. The pods are able to join themselves to the existing PingDirectory topology without issue.

Scaling down is where the preStop hook comes in to play, as the linked documentation describes. There is a significant time cost with large data sets and large topologies to have this hook enabled at all times. The main issue with having the hook enabled at all times is it leads to a lot of unnecessary leaving and rejoining the topology if the PD pods ever stop for reasons other than scaling down. For example, on a simple restart of a pod (such as to apply some update from a changed server profile), the pod would leave the topology, restart, and then have to rejoin the topology and reinitialize its data. This can take a long time depending on the server profile and the dataset.

It's hard to make a generate estimate on exactly how long since it varies a lot depending on the size of the topology and the dataset, but I believe it could add a few minutes to start times for each pod.

If you do enable this hook, PingDirectory could scale up and down without intervention. But due to the above caveats we don't recommend autoscaling for PingDirectory. Our helm charts don't support directly enabling HPA with PingDirectory.

MihaiNuuday commented 1 year ago

Thanks for the answer. It's extremely helpful. As a follow-up question - what happens technically when a pod leaves the topology?

henryrecker-pingidentity commented 1 year ago

I'm not sure exactly what you are asking. The leaving pod has to tell the other pods to remove the leaving pod from their configuration and stop attempting to replicate data to that leaving pod. Once that config change has reached all the other servers in the topology, the pod is done and can shut down.

MihaiNuuday commented 1 year ago

I guess what I had in mind was to narrow down on the question - what makes leaving the topology "slow".

Is it perhaps that the leaving pod might drop data upon leaving (or perform some sort of cleanup?) ? And/or subsequently needs to perform a full data sync when re-joining (is that what happens behind the scenes maybe?) ?

Or is it just the mere process of communication and acknowledgement with the cluster that the leaving pod is leaving (and rejoining) that is inefficient to perform?

In other words - what is it that makes the process of leaving and joining the PD topology heavy or inefficient? 🤔

henryrecker-pingidentity commented 1 year ago

In general the larger a topology grows, the time it takes to leave or join the topology can take a lot longer. The act of leaving or joining can take time because of having to lock the topology configuration and communicate the new server's connection info across every other server, and then come to an agreement on the master of the topology among all the servers.

After rejoining the topology, the specific step that can take a long time with a lot of user data is re-initializing replication (the dsreplication initialize command) from an existing server to the server being added. Leaving the topology won't cause any data to be removed from either the server leaving or the remaining servers, but it still needs to be re-initialized when it rejoins to ensure everything is in sync.

If you have a large data set, it's the re-initializing replication step that is likely to take a long time. If you have a lot of servers, the act of leaving (remove-defunct-server command) or joining (dsreplication enable command) the topology is likely to take a long time.

MihaiNuuday commented 1 year ago

Thank you Henry! Really really great insights. I can certainly see the behavior you're describing when recreating the scenario in a controlled environment. Theoretically speaking, this could be a tradeoff that one could take, depending on the use case. For example, with a relatively small user base (say low millions or records range), one could sacrifice a few minutes in rollout times for the ability to autoscale horizontally on demand.

Now, if I can flip the question - let's imagine one were to scale the PD cluster up and down manually between 3 to 6 replicas, all while without triggering the 90-shutdown-sequence.sh at any point. Theoretically, on peak cluster size, everything would be "normal". However, on the lower end of the cluster size, the topology would constantly try to re-connect to the non-existing PD nodes. Functionally, and out of pure curiosity - how would this state (of missing parts of a PD topology) affect the functioning of the PD cluster? Would there be a performance degradation do you think? Or would there eventually be a malfunction of the PD cluster? 🤔

This is pure curiosity, and a theoretical question; I fully recognize this is not how a PD cluster should be operated. But it's very interesting and potentially even essential to have an idea about in an emergency situation :)

henryrecker-pingidentity commented 1 year ago

That would definitely cause problems. Of course, there would be a constant stream of errors from the remaining servers trying to connect to the ones that were removed. And when there aren't enough reachable PD servers to elect a master (half of the total servers plus one) the PD topology goes into read-only mode. The data can still be read and replicated in this mode, but the topology itself can't be changed without manual intervention. When you get into this read-only/no-master state, you end up having to manually run commands to remove the unreachable servers or to force one of the remaining servers to act as master. It might be possible to scale up again as normal to fix things if the persistent volumes of the removed pods stick around, but I wouldn't expect it to work well without some manual intervention.

MihaiNuuday commented 1 year ago

Great. And I would assume that the PD cluster and helm chart is smart enough that this readonly scenario would not happen during major kubernetes maintenance scenarios such as during major cluster upgrades when kubernetes nodes might be recycled and pods are moved to new nodes, right?

henryrecker-pingidentity commented 1 year ago

If you take too many pods down at the same time you could still run into it, due to having too few servers still running. But you could configure the cluster upgrade to avoid taking down more than one at a time.

PingDavidR commented 1 year ago

Closing this issue as resolved. Please reach out @MihaiNuuday if you have further questions by opening another ticket.