Splunk Operator: Autoscaling Issue

nathan-bowman commented 4 months ago

Please select the type of request

Bug

Tell us more

Describe the Problem I'm following the details here for pod autoscaling. It seems that spec.replicas is a mandatory field, but with the HorizontalPodAutoscaler docs recommend that you remove spec.replicas from the target manifest.

When an HPA is enabled, it is recommended that the value of spec.replicas of the Deployment and / or StatefulSet be removed from their manifest(s).

Error I receive when I remove spec.replicas:

the HPA controller was unable to get the target's current scale: Internal error occurred: the spec replicas field ".spec.replicas" does not exist

Expected behavior One should be able to remove spec.replicas from the Splunk CR indexerclusters.enterprise.splunk.com (and probably other CRs...) to allow HorizontalPodAutoscaler to manage the spec.replicas.

Splunk setup on K8S AWS EKS v1.29, with Splunk Operator 2.5.2.

Last thing to note, I'm using autoscaling/v2 apiVersion

Reproduction/Testing steps idx-cluster.yaml:

---
apiVersion: enterprise.splunk.com/v4
kind: IndexerCluster
metadata:
  name: idx-cluster
  finalizers:
  - enterprise.splunk.com/delete-pvc
spec:
  imagePullPolicy: IfNotPresent
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: affinityNodeGroup
            operator: In
            values:
            - splunk-nodegroup-indexers
  tolerations:
  - key: splunk-indexers
    value: "true"
    effect: NoExecute
  serviceAccount: splunk-enterprise-serviceaccount
  resources:
    limits:
      cpu: 15
      memory: 60G
    requests:
      cpu: 13
      memory: 57G
  #replicas: 3
  clusterManagerRef:
    name: cm
  licenseManagerRef:
    name: lm
  monitoringConsoleRef:
    name: mc

  etcVolumeStorageConfig:
    ephemeralStorage: false
    storageCapacity: 10Gi
    storageClassName: gp3

  varVolumeStorageConfig:
    ephemeralStorage: false
    storageCapacity: 800Gi
    storageClassName: topolvm-local-ssd

HorizontalPodAutoscaler yaml:

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: idx-cluster-autoscaler
spec:
  scaleTargetRef:
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    name: idx-cluster
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

K8s environment k8s v1.29

nathan-bowman commented 4 months ago

I'm not entirely sure if this will mess things up, but I got the autoscaling to work by pointing the HPA at the IndexerCluster's downstream statefulset:

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: idx-cluster-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: splunk-idx-cluster-indexer
  minReplicas: 3
  maxReplicas: 15
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 25

Is this correct?

nathan-bowman commented 4 months ago

Going down the rabbit hole...

It looks like my HPA isn't gathering metrics for the target:

NAME                          REFERENCE                         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
idx-cluster-autoscaler   IndexerCluster/idx-cluster             <unknown>/25%   3         15        3          18h

Additional digging into the metric-server shows lots of scrape errors:

E0717 23:45:19.725688       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.28.96:10250/metrics/resource\": remote error: tls: internal error" node="ip-192-168-28-96.us-west-2.compute.internal"

I think this is related to a recent issue posted in the official metrics-server repo, and an associated PR.

I'm not totally sure, though... Other HPA's in my EKS clusters seem to work fine...

Edit: To clarify, the HPA sees the targets data when I point it towards the statefulset, but not when I point it towards kind: IndexerCluster

nathan-bowman commented 4 months ago

Adding more info...

# kubectl --raw /apis/enterprise.splunk.com/v4/ | jq '.resources[] | select(.name=="indexerclusters/scale")'
{
  "name": "indexerclusters/scale",
  "singularName": "",
  "namespaced": true,
  "group": "autoscaling",
  "version": "v1",
  "kind": "Scale",
  "verbs": [
    "get",
    "patch",
    "update"
  ]
}

is v1 pointing to autoscaling? I'm using autoscaling/v2 in my HPA

nathan-bowman commented 4 months ago

I tried autoscaling/v1 and have the same issue 👎

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: >-
      [{"type":"AbleToScale","status":"False","lastTransitionTime":"2024-07-18T19:46:25Z","reason":"FailedGetScale","message":"the
      HPA controller was unable to get the target's current scale: Internal
      error occurred: the spec replicas field \".spec.replicas\" does not
      exist"}]
  creationTimestamp: '2024-07-18T19:46:10Z'
  labels:
    app.kubernetes.io/instance: backend-staging-splunk-enterprise
  name: idx-cluster-autoscaler
  namespace: splunk-enterprise
  resourceVersion: '171496564'
  uid: e655b8c2-a04a-48c6-b882-1b4bd29fa2f4
spec:
  maxReplicas: 15
  minReplicas: 3
  scaleTargetRef:
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    name: idx-cluster
  targetCPUUtilizationPercentage: 25
status:
  currentReplicas: 0
  desiredReplicas: 0

nathan-bowman commented 2 months ago

I worked with Splunk support on this, and they suggest that despite Kubernetes docs saying otherwise, you must hardcode .spec.replicas on the CR in order to get it working.

Since I use ArgoCD, I had to set ignoreDifferences on the CR to stop it from showing up as out of sync.

akondur commented 2 months ago

CSPL-2819

splunk / splunk-operator

Splunk Operator: Autoscaling Issue #1352

Please select the type of request

Tell us more