Open devkits opened 1 month ago
It depends on exactly where you're seeing an issue. Adding more replicas won't necessarily improve controller performance (e.g. long driver failure to resubmission) due to leader election, however increasing the number of controller threads/concurrent reconciles and giving the operator more or better CPU/memory should result in less reconciliation delay.
I've personally run the pre-2.0.0 version of the operator with ~100 apps and ~1,200 Spark pods and had a mostly good experience, but definitely had occasional issues with reconciliation delay when the controller is busy with lots of state updates or re-submissions. While measuring and improving controller performance is definitely on my list of things to tackle, I believe a good escape hatch is to run multiple deployments of the operator targeting different namespaces.
Please describe your question here
What are some of the guidelines for scaling the spark operator in large environments (specifics below). Thanks.
Provide a link to the example/module related to the question
We are using spark-operator version v1beta2-1.6.2-3.5.0 Updating the
replicaCount
is one way to scale out: https://github.com/kubeflow/spark-operator/blob/ccb3ceb54b6f09de8c67d96484fc911d122dcee3/charts/spark-operator-chart/values.yaml#L10 but is there a suggested ratio of operator replicas to the number of resources that it manages?Perhaps there's a way to dynamically scale the operator so that it can keep up?
Additional context
The scale in question: one spark operator deployment managing 40 spark applications/namespaces, each application having 10 drivers x 1-10 executors = ~100 executors for a total of ~4000 resources that needs to be managed.
We found that redeploying spark operator pods or scaling up the number of replicas helps in this situation.