[web hook] when storage scale out and pending, because of resource not enough, then can not execute scale in ,it seams stuck

jinyingsunny commented 11 months ago

when enable web hook. scale out storage but failed, because cpu not enough

$ kubectl -n nebula describe pod nebulazone-storaged-9
Name:             nebulazone-storaged-9
Namespace:        nebula
Priority:         0
Service Account:  nebula-sa
Node:             <none>
Labels:           app.kubernetes.io/cluster=nebulazone
                  app.kubernetes.io/component=storaged
                  app.kubernetes.io/managed-by=nebula-operator
                  app.kubernetes.io/name=nebula-graph
                  controller-revision-hash=nebulazone-storaged-5b568d554c
                  statefulset.kubernetes.io/pod-name=nebulazone-storaged-9
Annotations:      cloud.google.com/cluster_autoscaler_unhelpable_since: 2023-10-09T09:58:34+0000
                  cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
                  nebula-graph.io/cm-hash: 760645648930d20e
Status:           Pending
IP:
IPs:              <none>
Controlled By:    StatefulSet/nebulazone-storaged
Containers:
  storaged:
    Image:       asia-east2-docker.pkg.dev/nebula-cloud-test/poc/rc/nebula-storaged-ent:v3.5.0-sc
    Ports:       9779/TCP, 19789/TCP, 9778/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/sh
      -ecx
      exec /usr/local/nebula/bin/nebula-storaged --flagfile=/usr/local/nebula/etc/nebula-storaged.conf --meta_server_addrs=nebulazone-metad-0.nebulazone-metad-headless.nebula.svc.cluster.local:9559,nebulazone-metad-1.nebulazone-metad-headless.nebula.svc.cluster.local:9559,nebulazone-metad-2.nebulazone-metad-headless.nebula.svc.cluster.local:9559 --local_ip=$(hostname).nebulazone-storaged-headless.nebula.svc.cluster.local --ws_ip=$(hostname).nebulazone-storaged-headless.nebula.svc.cluster.local --daemonize=false --ws_http_port=19789
    Limits:
      cpu:     3
      memory:  16Gi
    Requests:
      cpu:        2
      memory:     8Gi
    Readiness:    http-get http://:19789/status delay=10s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /usr/local/nebula/data from storaged-data (rw,path="data")
      /usr/local/nebula/etc/nebula-storaged.conf from nebulazone-storaged (rw,path="nebula-storaged.conf")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j86r9 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  storaged-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  storaged-data-nebulazone-storaged-9
    ReadOnly:   false
  nebulazone-storaged:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nebulazone-storaged
    Optional:  false
  kube-api-access-j86r9:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/cluster=nebulazone,app.kubernetes.io/component=storaged,app.kubernetes.io/managed-by=nebula-operator,app.kubernetes.io/name=nebula-graph
Events:
  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   48s   nebula-scheduler    0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
  Warning  FailedScheduling   45s   nebula-scheduler    0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
  Normal   NotTriggerScaleUp  46s   cluster-autoscaler  pod didn't trigger scale-up:

Your Environments (required)

nebula-operator: snap1.19

Expected behavior

when pending cause by resource , stop the scale out ,return to last status .

jinyingsunny commented 11 months ago

i resolve the problem by edit nebula-operator deployment set --enable-admission-webhook=false, to let webhook stop

MegaByte875 commented 11 months ago

I think insufficient resource problem is not a bug, admission webhook is used for preventing operations in intermediate state.

vesoft-inc / nebula-operator

[web hook] when storage scale out and pending, because of resource not enough, then can not execute scale in ,it seams stuck #320