operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.68k stars 542 forks source link

Automatically scale down operators replicas on SNO clusters #2453

Open tiraboschi opened 2 years ago

tiraboschi commented 2 years ago

Feature Request

Is your feature request related to a problem? Please describe. On SNO clusters running more replicas of a single operator is probably just a waste of resources running all of them on the single node. Each operator should take care of its operands according to cluster topology, but the replicas of the operators are defined statically in the CSV at build time.

In the CSV indeed we have something like:

      deployments:
      - name: operator-deployment-name
        spec:
          replicas: N

And so the OLM will create a deployment always requiring N replicas regardless of the actual custer topology.

Having each operator trying to scale down itself at runtime overriding what statically wrote in the CSV looks like an antipattern.

Describe the solution you'd like The OLM should check the Openshift Cluster High-availability Mode API (https://github.com/openshift/enhancements/blob/master/enhancements/single-node/cluster-high-availability-mode-api.md) and scale down all the operators replicas to 1 if on SNO clusters. It will be still completely up to each operator to scale down its operands. This can be eventually done only for operators configured with: operators.openshift.io/infrastructure-features: '["sno"]'

Additional info The OLM is already scaling down PackageServer replicas on SNO. Currently on community operators catalog, only:

operators are configured with more than 1 replica.

bparees commented 2 years ago

Having each operator trying to scale down itself at runtime overriding what statically wrote in the CSV looks like an antipattern.

this is a fair point, especially if/when OLM behaves more like the CVO by continuously reconciling the operator resources, instead of just applying them once and walking away (at which point it would be impossible for the operator to modify its own deployment).

that said, we would not put support for an openshift specific api into olm upstream, so we'd need to think about whether there is a generically useful capability that olm could provide, with specific behavior for that capability defined when running on openshift.