slateci / slate-catalog

SLATE application catalog based on Helm
12 stars 21 forks source link

frontier-squid is a deployment of size 1 and can not be scaled up #583

Open rptaylor opened 2 years ago

rptaylor commented 2 years ago

The deployment size is fixed at 1. In principle it could be made a variable and increased, however this would clash with use of ReadWriteOnce PVCs, as each squid pod would try to mount the same volume.

The scenario where you'd want to scale up the squid deployment (e.g. performance and robustness in a production environment with a heavy load on the squids) is also one where you'd care about cold caches for performance reasons and want to use persistent storage via a PVC instead of emptydirs.

The only way to do that is by using a statefulSet instead of a deployment, which provides volumeClaimTemplates so each member of the statefulSet can have its own PVC.

A statefulSet can use emptyDirs but as the converse is not possible (a deployment > 1 can not use PVCs) it seems that a statefulSet would be the only way to make the helm chart scalable. Aside from scalability this would also improve the resilience and availability of the squid service, e.g. on upgrade each member would be updated in a rolling manner instead of a brief outage of the one SPOF pod in the deployment.

Supposing that the deployment is changed to a statefulset, I think the impact on existing users (who use PVCs and a replica size of 1) would only be a one-time cache emptying when upgrading via Helm; the old PVC would be deleted and a new one with a slightly different name would be made and used by the new statefulSet squid.

However having multiple squids behind the same monitoring service might show inconsistent results. In principle this could be handled with separate services for the monitoring port of each individual statefulSet member. That being said I think the same issue exists with multi-worker squid instances? So it may not be a big deal.

LincolnBryant commented 2 years ago

So one of the problems we had with scaling up squids is that the monitoring port is impossible(?) to change in the WLCG monitoring. As far as I understand, they have everything hard-coded to port 3401 for Squid SNMP monitoring. That makes a nodePort-based Squid a bit more challenging in a number of ways, unfortunately. If you don't care about whether the WLCG can see your SNMP ports then it's definitely doable to change things over to a Statefulset.

I have some more thoughts but I'll have to follow up later :)

rptaylor commented 2 years ago

True, if the number of members is > 1 (regardless of whether a deployment or statefulset is used) they could not all use the same nodeport. What is the use case for a nodeport-based squid monitoring service? A nodeport is just a ClusterIP with an extra kubeproxy/ipvs forwarding rule on top; it allows service discovery by an arbitrary local port number instead of a service name. For access from outside the cluster, it requires public IPs on the kubelet nodes, so IIUC you would either have a special (SPOF) node with the right public IP, or you would still need a LB on top across all of the public IPs of the nodes. That being said it can be a bit easier in some environments than setting up a k8s-native LBaaS.

The "standard" mechanism for exposing external access to cluster services is ingress (and nearly all the ingress providers have TCP and UDP extensions). Anyway for our use case we would not encounter the nodePort issue; we just need ClusterIP services and we would create Traefik IngressRoutes to expose them externally. (The thought did occur to me that it might be useful to be able to configure the squid client and monitoring services differently instead of having them together.) Anyway figuring out a way to monitor each one of multiple squids individually is an issue, but to me it seems orthogonal to the network access method and deployment vs statefulset question.

It would be possible in principle to have e.g. for a statefulset of size 3, 3 clusterIP services and 3 ingresses, to access the monitoring of each squid independently. It might involve a for loop in Helm and some tricks but should be doable I think.

rptaylor commented 2 years ago

Using ingress also provides a way to avoid issues with WLCG monitoring using the hardcoded port 3401; you can externally expose any arbitrary port and map it to any cluster service (details depend on ingress provider).

LincolnBryant commented 2 years ago

Right, as is currently configured the squid is essentially bound to a single host which is indeed a SPOF. We're running a bunch of tiny clusters distributed around the US that are mostly 1 node anyhow at the moment! For our use cases we have clusters that are using Squid in K8S as a replacement for Squid e.g. in a VM or on bare metal. So public IP is required because all of the workers accessing the squid aren't in K8S.

Ingress is certainly possible too although some folks expressed concerns that the ingress wouldn't be able to handle the high number of packets per second that a heavily utilized Squid would require. I haven't tested it, but willing to see how it goes. We're largely using NGINX as our Ingress controller for SLATE - I haven't had much experience with trying to route general TCP/UDP packets through ingresses - my impression (from several years ago now) was that it didn't work all that well.

What I would prefer of course is to just have the WLCG monitoring be a bit more amenable to cloud native ways of deploying things :)

So anyhow - for your use case- do I understand correctly that using a StatefulSet with VolumeClaimTemplates would cover your needs? I have experience setting that up for other software, happy to try it here. Most of our users are actually just using hostPath (I know, not desirable) so we'd want to switch them over to using something like the local persistent volume provider instead in that case.

rptaylor commented 2 years ago

Certainly some ingress controllers are more performant than others under high load, but a lot of massive web-scale apps runs on k8s behind ingress or service meshes. A single ingress pod should typically be able to handle ~ 10K HTTP RPS without much trouble and you can scale up as many as needed; though the TCP performance may be different (in principle I would think it should be approximately comparable to anything else that involves routing to another node, like nodePort or NAT). We moved away from the NGINX community controller due to poor performance and security issues.

Anyway our squid clients (compute jobs) are all inside the cluster so it won't be an issue for us, nor for the other users you mention if they continue to use nodeport and number of squid pods = 1.

do I understand correctly that using a StatefulSet with VolumeClaimTemplates would cover your needs?

Yep I think so!

For the record I'm also looking at https://github.com/sciencebox/charts/tree/master/frontier-squid

rptaylor commented 2 years ago

Actually it seems that frontier-squid still deletes the cache every time it is restarted (because apparently squid can corrupt the cache on restart), so persistent storage wouldn't be useful anyway. :/ So a deployment > 1 with ephemeral storage would work. The sciencebox chart already does that so I'm going to give it a try.