kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.37k forks source link

[QUESTION] Best practices for scaling k8s spark-operator in large environments #2192

Open devkits opened 3 days ago

devkits commented 3 days ago

Please describe your question here

What are some of the guidelines for scaling the spark operator in large environments (specifics below). Thanks.

Provide a link to the example/module related to the question

We are using spark-operator version v1beta2-1.6.2-3.5.0 Updating the replicaCount is one way to scale out: https://github.com/kubeflow/spark-operator/blob/ccb3ceb54b6f09de8c67d96484fc911d122dcee3/charts/spark-operator-chart/values.yaml#L10 but is there a suggested ratio of operator replicas to the number of resources that it manages?

Perhaps there's a way to dynamically scale the operator so that it can keep up?

Additional context

The scale in question: one spark operator deployment managing 40 spark applications/namespaces, each application having 10 drivers x 1-10 executors = ~100 executors for a total of ~4000 resources that needs to be managed.

We found that redeploying spark operator pods or scaling up the number of replicas helps in this situation.

jacobsalway commented 2 days ago

It depends on exactly where you're seeing an issue. Adding more replicas won't necessarily improve controller performance (e.g. long driver failure to resubmission) due to leader election, however increasing the number of controller threads/concurrent reconciles and giving the operator more or better CPU/memory should result in less reconciliation delay.

I've personally run the pre-2.0.0 version of the operator with ~100 apps and ~1,200 Spark pods and had a mostly good experience, but definitely had occasional issues with reconciliation delay when the controller is busy with lots of state updates or re-submissions. While measuring and improving controller performance is definitely on my list of things to tackle, I believe a good escape hatch is to run multiple deployments of the operator targeting different namespaces.