Exposing scale subresource to enable HPA's to modify the replicas of Infinispan CR

ZeidH commented 2 months ago

Enhanced the spec.subresources field for the Infinispan CRD to map scaling

      scale:
        specReplicasPath: .spec.replicas
        statusReplicasPath: .status.replicas

This enables APIs like the HorizontalPodAutoscaler and KEDA to modify the replica count of the Cache.

openshift-ci[bot] commented 2 months ago

Hi @ZeidH. Thanks for your PR.

I'm waiting for a infinispan member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

ZeidH commented 2 months ago

Thanks for the clear instructions @ryanemerson!, I'm rather new to this so appreciate the help. Encountered this issue with controller-gen on my version of go (1.22.5) so I also updated it to latest (1.15.0).

ryanemerson commented 2 months ago

Thanks for updating @ZeidH. It seems like we need to increase our baseline go version in order to use that version of controller-gen. I have created a PR to baseline on 1.21 (1.20 is now EOL) and update the controller-gen version https://github.com/infinispan/infinispan-operator/pull/2134. I'll let you know once that PR has been merged so that you can rebase this PR.

ZeidH commented 1 month ago

I've tested the autoscaling with this setup:

./scripts/ci/kind-with-olm.sh
make deploy-cert-manager
helm install keda kedacore/keda --namespace keda --create-namespace
skaffold dev

and a ScaledObject that targets the Infinispan CR

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: infinispan-scaler
  namespace: infinispan-operator-system
spec:
  minReplicaCount: 0
  scaleTargetRef:
    apiVersion: infinispan.org/v1
    kind: Infinispan
    name: test-cache
  triggers:
    - metadata:
        desiredReplicas: '2'
        start: 52 8 * * *
        end: 59 8 * * *
        timezone: UTC
      type: cron
---
apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  annotations:
    infinispan.org/monitoring: 'true'
    infinispan.org/operatorPodTargetLabels: 'rht.comp,rht.comp_ver,rht.prod_name,rht.prod_ver,rht.subcomp_t'
  name: test-cache
  namespace: infinispan-operator-system
spec:
  service:
    container:
      livenessProbe:
        failureThreshold: 5
        initialDelaySeconds: 0
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      readinessProbe:
        failureThreshold: 5
        initialDelaySeconds: 0
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      startupProbe:
        failureThreshold: 600
        initialDelaySeconds: 3
        periodSeconds: 1
        successThreshold: 1
        timeoutSeconds: 1
    type: DataGrid
  jmx: {}
  configListener:
    enabled: true
    logging:
      level: info
  upgrades:
    type: Shutdown
  replicas: 1

The Statefulset was able to scale from zero to n and back but I can imagine that for some configurations scaling to zero is not possible

ryanemerson commented 4 weeks ago

Apologies for the delay @ZeidH, but main is now baselined on go 1.21. Can you please rebase on the latest code?

ryanemerson commented 4 weeks ago

but I can imagine that for some configurations scaling to zero is not possible

Can you explain which configurations you think would cause issues?

ZeidH commented 3 weeks ago

but I can imagine that for some configurations scaling to zero is not possible

Can you explain which configurations you think would cause issues?

I can't think of any specific to Infinispan, the amount of available configurations is so immense it would take a long time to figure out what config can be autoscaled, and what cannot. I think after learning a bit on what Infinispan does the past few months, the configs where autoscaling doesn't work well also most likely wouldn't make much sense to use autoscaling/scale to zero, like clustered caches where you (almost)always want to have a hot instance.

But from experience there can be issues with autoscaling when using specific types of storage configurations, or as @rigazilla mentioned in the GH Issue: about the operator sometimes needs to explicitly set the .spec.Replicas, i.e. in upgrade

What we attempted to do is to dynamically scale a replicated cache with the amount of nodes (1 cache per node), essentially making Infinispan behave like a DaemonSet. Although we're not done with it just yet, so far it's been working out well.

ryanemerson commented 1 day ago

Thanks @ZeidH and sorry for the delay. I'll create a follow up PR with some documentation. Let me know if you need this feature soon via OLM and I can expedite the next release.

infinispan / infinispan-operator

Exposing scale subresource to enable HPA's to modify the replicas of Infinispan CR #2133