vesoft-inc / nebula-operator

Operation utilities for Nebula Graph
https://vesoft-inc.github.io/nebula-operator
Apache License 2.0
79 stars 28 forks source link

[web hook] when storage scale out and pending, because of resource not enough, then can not execute scale in ,it seams stuck #320

Open jinyingsunny opened 11 months ago

jinyingsunny commented 11 months ago

when enable web hook. scale out storage but failed, because cpu not enough image

image

image

$ kubectl -n nebula describe pod nebulazone-storaged-9
Name:             nebulazone-storaged-9
Namespace:        nebula
Priority:         0
Service Account:  nebula-sa
Node:             <none>
Labels:           app.kubernetes.io/cluster=nebulazone
                  app.kubernetes.io/component=storaged
                  app.kubernetes.io/managed-by=nebula-operator
                  app.kubernetes.io/name=nebula-graph
                  controller-revision-hash=nebulazone-storaged-5b568d554c
                  statefulset.kubernetes.io/pod-name=nebulazone-storaged-9
Annotations:      cloud.google.com/cluster_autoscaler_unhelpable_since: 2023-10-09T09:58:34+0000
                  cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
                  nebula-graph.io/cm-hash: 760645648930d20e
Status:           Pending
IP:
IPs:              <none>
Controlled By:    StatefulSet/nebulazone-storaged
Containers:
  storaged:
    Image:       asia-east2-docker.pkg.dev/nebula-cloud-test/poc/rc/nebula-storaged-ent:v3.5.0-sc
    Ports:       9779/TCP, 19789/TCP, 9778/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/sh
      -ecx
      exec /usr/local/nebula/bin/nebula-storaged --flagfile=/usr/local/nebula/etc/nebula-storaged.conf --meta_server_addrs=nebulazone-metad-0.nebulazone-metad-headless.nebula.svc.cluster.local:9559,nebulazone-metad-1.nebulazone-metad-headless.nebula.svc.cluster.local:9559,nebulazone-metad-2.nebulazone-metad-headless.nebula.svc.cluster.local:9559 --local_ip=$(hostname).nebulazone-storaged-headless.nebula.svc.cluster.local --ws_ip=$(hostname).nebulazone-storaged-headless.nebula.svc.cluster.local --daemonize=false --ws_http_port=19789
    Limits:
      cpu:     3
      memory:  16Gi
    Requests:
      cpu:        2
      memory:     8Gi
    Readiness:    http-get http://:19789/status delay=10s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /usr/local/nebula/data from storaged-data (rw,path="data")
      /usr/local/nebula/etc/nebula-storaged.conf from nebulazone-storaged (rw,path="nebula-storaged.conf")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j86r9 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  storaged-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  storaged-data-nebulazone-storaged-9
    ReadOnly:   false
  nebulazone-storaged:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nebulazone-storaged
    Optional:  false
  kube-api-access-j86r9:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/cluster=nebulazone,app.kubernetes.io/component=storaged,app.kubernetes.io/managed-by=nebula-operator,app.kubernetes.io/name=nebula-graph
Events:
  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   48s   nebula-scheduler    0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
  Warning  FailedScheduling   45s   nebula-scheduler    0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
  Normal   NotTriggerScaleUp  46s   cluster-autoscaler  pod didn't trigger scale-up:

Your Environments (required)

nebula-operator: snap1.19

Expected behavior

when pending cause by resource , stop the scale out ,return to last status .

jinyingsunny commented 11 months ago

i resolve the problem by edit nebula-operator deployment set --enable-admission-webhook=false, to let webhook stop

image

MegaByte875 commented 11 months ago

I think insufficient resource problem is not a bug, admission webhook is used for preventing operations in intermediate state.