kubewharf / godel-scheduler

a unified scheduler for online and offline tasks
Apache License 2.0
377 stars 58 forks source link

Quickstart - Basic Pod Scheduling run failed #38

Closed katoomegumi closed 2 months ago

katoomegumi commented 3 months ago

after execute $ kubectl apply -f manifests/quickstart-feature-examples/basic-pod-scheduling/deployment.yaml, pods is pending but not running.

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
basic-858b79df7f-fd9jd   0/1     Pending   0          18m
basic-858b79df7f-msg5n   0/1     Pending   0          18m
basic-858b79df7f-xddfl   0/1     Pending   0          18m

There's no log output and no pod events.

$ kubectl logs basic-858b79df7f-xddfl nginx
$ kubectl describe pod
Name:             basic-858b79df7f-fd9jd
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=basic
                  pod-template-hash=858b79df7f
Annotations:      godel.bytedance.com/pod-launcher: kubelet
                  godel.bytedance.com/pod-resource-type: guaranteed
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/basic-858b79df7f
Containers:
  nginx:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     1
      memory:  1Mi
    Requests:
      cpu:        1
      memory:     1Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6hc4b (ro)
Volumes:
  kube-api-access-6hc4b:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Name:             basic-858b79df7f-msg5n
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=basic
                  pod-template-hash=858b79df7f
Annotations:      godel.bytedance.com/pod-launcher: kubelet
                  godel.bytedance.com/pod-resource-type: guaranteed
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/basic-858b79df7f
Containers:
  nginx:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     1
      memory:  1Mi
    Requests:
      cpu:        1
      memory:     1Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zmr86 (ro)
Volumes:
  kube-api-access-zmr86:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Name:             basic-858b79df7f-xddfl
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=basic
                  pod-template-hash=858b79df7f
Annotations:      godel.bytedance.com/pod-launcher: kubelet
                  godel.bytedance.com/pod-resource-type: guaranteed
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/basic-858b79df7f
Containers:
  nginx:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     1
      memory:  1Mi
    Requests:
      cpu:        1
      memory:     1Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9njqr (ro)
Volumes:
  kube-api-access-9njqr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

I edit the godel-demo-default.yaml file. Change the image: kindest/node:v1.21.1 to v1.29.2.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: godel-demo-default
nodes:
  - role: control-plane
    image: kindest/node:v1.29.2
  - role: worker
    image: kindest/node:v1.29.2

Environment

$ uname -a
Linux sailg1-PowerEdge-T640 6.5.0-18-generic #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb  7 11:40:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

kubectl version: 1.29.2 docker version: 26.0.0 kind version: v0.22.0 go version: 1.22.0 kustomize version: v5.3.0

NickrenREN commented 3 months ago

cc @yuey002 PTAL

yuey002 commented 3 months ago

@katoomegumi Scheduling the basic pod workload works for me. I think it's most likely the godel system was not installed correctly or ran into errors in your env. Would you mind running the below commands to check the status of godel components?

  1. check godel dispatcher/scheduler/binder status; and if any pod is not Running, use describe to check for details

    kubectl get pods -n godel-system
  2. if all godel pods are running, check for logs (replace the pod name with your dispatcher pod)

    kubectl logs dispatcher-76bcfcb9d7-jtlcx -n godel-system | grep -i basic
katoomegumi commented 3 months ago

@yuey002 these pods are not running

$ kubectl get pods -n godel-system
NAME                          READY   STATUS             RESTARTS          AGE
binder-556bcdcfdd-z9r79       0/1     CrashLoopBackOff   189 (2m19s ago)   15h
dispatcher-6f444dc587-dzn4h   0/1     CrashLoopBackOff   189 (74s ago)     15h
scheduler-7694d9dbdd-vmjlh    0/1     CrashLoopBackOff   189 (2m45s ago)   15h
$ kubectl describe pods -n godel-system
Name:             binder-556bcdcfdd-z9r79
Namespace:        godel-system
Priority:         0
Service Account:  godel
Node:             godel-demo-default-control-plane/172.19.0.3
Start Time:       Tue, 26 Mar 2024 00:48:24 +0800
Labels:           app=binder
                  pod-template-hash=556bcdcfdd
Annotations:      <none>
Status:           Running
IP:               10.244.0.5
IPs:
  IP:           10.244.0.5
Controlled By:  ReplicaSet/binder-556bcdcfdd
Containers:
  binder:
    Container ID:  containerd://16d9e144c58fef6e4ff0b2d123b878e3f7580eaca5b1bb64be9eb55f9cf52632
    Image:         godel-local:latest
    Image ID:      docker.io/library/import-2024-03-25@sha256:98605c312771b5ab50725c193db8c9212f3d7c1ab78269f98dde3046b40fe254
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/binder
    Args:
      --leader-elect=false
      --tracer=noop
      --v=5
      --config=/config/binder.config
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 Mar 2024 16:36:25 +0800
      Finished:     Tue, 26 Mar 2024 16:36:25 +0800
    Ready:          False
    Restart Count:  190
    Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        1
      memory:     1G
    Environment:  <none>
    Mounts:
      /config from binder-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b46np (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  binder-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      godel-binder-config
    Optional:  false
  kube-api-access-b46np:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  2m47s (x4369 over 15h)  kubelet  Back-off restarting failed container binder in pod binder-556bcdcfdd-z9r79_godel-system(5db9f4cc-8e24-4712-8b4f-ba5e267263c7)

Name:             dispatcher-6f444dc587-dzn4h
Namespace:        godel-system
Priority:         0
Service Account:  godel
Node:             godel-demo-default-control-plane/172.19.0.3
Start Time:       Tue, 26 Mar 2024 00:48:24 +0800
Labels:           app=godel-dispatcher
                  pod-template-hash=6f444dc587
Annotations:      <none>
Status:           Running
IP:               10.244.0.6
IPs:
  IP:           10.244.0.6
Controlled By:  ReplicaSet/dispatcher-6f444dc587
Containers:
  dispatcher:
    Container ID:  containerd://2312785d6170d2239845259942210693a77e9d269777abd43462948f29126a49
    Image:         godel-local:latest
    Image ID:      docker.io/library/import-2024-03-25@sha256:98605c312771b5ab50725c193db8c9212f3d7c1ab78269f98dde3046b40fe254
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/dispatcher
    Args:
      --leader-elect=false
      --tracer=noop
      --v=5
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 Mar 2024 16:37:34 +0800
      Finished:     Tue, 26 Mar 2024 16:37:34 +0800
    Ready:          False
    Restart Count:  190
    Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        1
      memory:     1G
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2p59d (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-2p59d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  2m56s (x4369 over 15h)  kubelet  Back-off restarting failed container dispatcher in pod dispatcher-6f444dc587-dzn4h_godel-system(eafebf02-eb7c-48c7-9e06-8841e41c9e80)

Name:             scheduler-7694d9dbdd-vmjlh
Namespace:        godel-system
Priority:         0
Service Account:  godel
Node:             godel-demo-default-control-plane/172.19.0.3
Start Time:       Tue, 26 Mar 2024 00:48:24 +0800
Labels:           app=godel-scheduler
                  pod-template-hash=7694d9dbdd
Annotations:      <none>
Status:           Running
IP:               10.244.0.7
IPs:
  IP:           10.244.0.7
Controlled By:  ReplicaSet/scheduler-7694d9dbdd
Containers:
  scheduler:
    Container ID:  containerd://717da0fc5ef5b8fb407af31ca48c5e103796b4142412e111c01f10a3f21ea8bb
    Image:         godel-local:latest
    Image ID:      docker.io/library/import-2024-03-25@sha256:98605c312771b5ab50725c193db8c9212f3d7c1ab78269f98dde3046b40fe254
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/scheduler
    Args:
      --leader-elect=false
      --tracer=noop
      --v=4
      --disable-preemption=false
      --config=/config/scheduler.config
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 Mar 2024 16:41:01 +0800
      Finished:     Tue, 26 Mar 2024 16:41:01 +0800
    Ready:          False
    Restart Count:  191
    Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        1
      memory:     1G
    Environment:  <none>
    Mounts:
      /config from scheduler-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c58tz (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  scheduler-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      godel-scheduler-config
    Optional:  false
  kube-api-access-c58tz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Warning  BackOff  2m47s (x4369 over 15h)  kubelet  Back-off restarting failed container scheduler in pod scheduler-7694d9dbdd-vmjlh_godel-system(a13880a0-5efe-4b07-b003-5371d361a712)
yuey002 commented 3 months ago

@katoomegumi Thank you! Could you also show me the logs

kubectl logs scheduler-7694d9dbdd-vmjlh -n godel-system

kubectl logs dispatcher-6f444dc587-dzn4h -n godel-system

kubectl logs binder-556bcdcfdd-z9r79 -n godel-system
katoomegumi commented 3 months ago

@yuey002 I'm sorry that I have tried, but it failed and output is as follows.

/usr/local/bin/scheduler: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/bin/scheduler)
/usr/local/bin/scheduler: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/local/bin/scheduler)
/usr/local/bin/dispatcher: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/bin/dispatcher)
/usr/local/bin/dispatcher: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/local/bin/dispatcher)
/usr/local/bin/binder: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/bin/binder)
/usr/local/bin/binder: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/local/bin/binder)

It shows that the 'GLIBC_2.32' and 'GLIBC_2.34' not found, but I can find them.

strings /lib/x86_64-linux-gnu/libc.so.6 |grep GLIBC_
GLIBC_2.2.5
GLIBC_2.2.6
GLIBC_2.3
GLIBC_2.3.2
GLIBC_2.3.3
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.5
GLIBC_2.6
GLIBC_2.7
GLIBC_2.8
GLIBC_2.9
GLIBC_2.10
GLIBC_2.11
GLIBC_2.12
GLIBC_2.13
GLIBC_2.14
GLIBC_2.15
GLIBC_2.16
GLIBC_2.17
GLIBC_2.18
GLIBC_2.22
GLIBC_2.23
GLIBC_2.24
GLIBC_2.25
GLIBC_2.26
GLIBC_2.27
GLIBC_2.28
GLIBC_2.29
GLIBC_2.30
GLIBC_2.31
GLIBC_2.32
GLIBC_2.33
GLIBC_2.34
GLIBC_2.35
GLIBC_PRIVATE
yuey002 commented 3 months ago

@katoomegumi Thanks for the info. I believe there's some compatibility issues between the local env and docker image. I made some quick fixes to move the build process into Dockerfile.

https://github.com/yuey002/godel-scheduler/tree/dev/yuey002/fix-dockerfile When getting a chance, could you clone my forked repo and check out 'dev/yuey002/fix-dockerfile', go over the quick start to see if it can fix your issue? I have verified in my env, but would like to see if that works in yours too. Thank you!

katoomegumi commented 3 months ago

@yuey002 sorry, I copy the branch 'fix-dockerfile' and the error still the same.

$ kubectl get pods -n godel-system
NAME                          READY   STATUS             RESTARTS      AGE
binder-8b46dbd65-cblps        0/1     CrashLoopBackOff   3 (32s ago)   74s
dispatcher-69f7d646b8-kl52j   0/1     CrashLoopBackOff   3 (32s ago)   74s
scheduler-59cbb6c57-w2f44     0/1     CrashLoopBackOff   3 (34s ago)   74s
$ kubectl describe pods -n godel-system
Name:             binder-8b46dbd65-cblps
Namespace:        godel-system
Priority:         0
Service Account:  godel
Node:             godel-demo-default-control-plane/172.19.0.2
Start Time:       Thu, 28 Mar 2024 16:22:58 +0800
Labels:           app=binder
                  pod-template-hash=8b46dbd65
Annotations:      <none>
Status:           Running
IP:               10.244.0.6
IPs:
  IP:           10.244.0.6
Controlled By:  ReplicaSet/binder-8b46dbd65
Containers:
  binder:
    Container ID:  containerd://7112861a74be037a524e833eee8770de5820d9dea455c2d522568c919de3eb90
    Image:         godel-local:latest
    Image ID:      docker.io/library/import-2024-03-28@sha256:a6c52c60e2e7e7b847aa3be08c40fb1cb89031f569dad9c77f8ee1dd6677dba3
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/binder
    Args:
      --leader-elect=false
      --tracer=noop
      --v=5
      --config=/config/binder.config
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 28 Mar 2024 16:23:40 +0800
      Finished:     Thu, 28 Mar 2024 16:23:40 +0800
    Ready:          False
    Restart Count:  3
    Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        1
      memory:     1G
    Environment:  <none>
    Mounts:
      /config from binder-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sshtl (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  binder-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      godel-binder-config
    Optional:  false
  kube-api-access-sshtl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  89s                default-scheduler  Successfully assigned godel-system/binder-8b46dbd65-cblps to godel-demo-default-control-plane
  Normal   Pulled     47s (x4 over 88s)  kubelet            Container image "godel-local:latest" already present on machine
  Normal   Created    47s (x4 over 88s)  kubelet            Created container binder
  Normal   Started    47s (x4 over 88s)  kubelet            Started container binder
  Warning  BackOff    9s (x7 over 85s)   kubelet            Back-off restarting failed container binder in pod binder-8b46dbd65-cblps_godel-system(53fd92af-2045-46af-b298-fea2a7a64dae)

Name:             dispatcher-69f7d646b8-kl52j
Namespace:        godel-system
Priority:         0
Service Account:  godel
Node:             godel-demo-default-control-plane/172.19.0.2
Start Time:       Thu, 28 Mar 2024 16:22:58 +0800
Labels:           app=godel-dispatcher
                  pod-template-hash=69f7d646b8
Annotations:      <none>
Status:           Running
IP:               10.244.0.5
IPs:
  IP:           10.244.0.5
Controlled By:  ReplicaSet/dispatcher-69f7d646b8
Containers:
  dispatcher:
    Container ID:  containerd://c690936e9a194622d9dbd6526323b5ec8943a2779b44c2f748ef04f48e28a13c
    Image:         godel-local:latest
    Image ID:      docker.io/library/import-2024-03-28@sha256:a6c52c60e2e7e7b847aa3be08c40fb1cb89031f569dad9c77f8ee1dd6677dba3
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/dispatcher
    Args:
      --leader-elect=false
      --tracer=noop
      --v=5
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 28 Mar 2024 16:24:22 +0800
      Finished:     Thu, 28 Mar 2024 16:24:22 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 28 Mar 2024 16:23:40 +0800
      Finished:     Thu, 28 Mar 2024 16:23:40 +0800
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        1
      memory:     1G
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wdrzz (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-wdrzz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  89s               default-scheduler  Successfully assigned godel-system/dispatcher-69f7d646b8-kl52j to godel-demo-default-control-plane
  Normal   Pulled     5s (x5 over 88s)  kubelet            Container image "godel-local:latest" already present on machine
  Normal   Created    5s (x5 over 88s)  kubelet            Created container dispatcher
  Normal   Started    5s (x5 over 88s)  kubelet            Started container dispatcher
  Warning  BackOff    4s (x7 over 85s)  kubelet            Back-off restarting failed container dispatcher in pod dispatcher-69f7d646b8-kl52j_godel-system(4e2362a1-7079-4389-ba1e-8b7197e3e6ac)

Name:             scheduler-59cbb6c57-w2f44
Namespace:        godel-system
Priority:         0
Service Account:  godel
Node:             godel-demo-default-control-plane/172.19.0.2
Start Time:       Thu, 28 Mar 2024 16:22:58 +0800
Labels:           app=godel-scheduler
                  pod-template-hash=59cbb6c57
Annotations:      <none>
Status:           Running
IP:               10.244.0.7
IPs:
  IP:           10.244.0.7
Controlled By:  ReplicaSet/scheduler-59cbb6c57
Containers:
  scheduler:
    Container ID:  containerd://b2fdffd6c0cb1451684b708f91f1005133cdc602289d4c406ddd4714a1689ac2
    Image:         godel-local:latest
    Image ID:      docker.io/library/import-2024-03-28@sha256:a6c52c60e2e7e7b847aa3be08c40fb1cb89031f569dad9c77f8ee1dd6677dba3
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/scheduler
    Args:
      --leader-elect=false
      --tracer=noop
      --v=4
      --disable-preemption=false
      --config=/config/scheduler.config
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 28 Mar 2024 16:24:21 +0800
      Finished:     Thu, 28 Mar 2024 16:24:21 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 28 Mar 2024 16:23:38 +0800
      Finished:     Thu, 28 Mar 2024 16:23:38 +0800
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        1
      memory:     1G
    Environment:  <none>
    Mounts:
      /config from scheduler-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vz24r (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  scheduler-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      godel-scheduler-config
    Optional:  false
  kube-api-access-vz24r:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  89s               default-scheduler  Successfully assigned godel-system/scheduler-59cbb6c57-w2f44 to godel-demo-default-control-plane
  Normal   Pulled     6s (x5 over 88s)  kubelet            Container image "godel-local:latest" already present on machine
  Normal   Created    6s (x5 over 88s)  kubelet            Created container scheduler
  Normal   Started    6s (x5 over 88s)  kubelet            Started container scheduler
  Warning  BackOff    5s (x7 over 85s)  kubelet            Back-off restarting failed container scheduler in pod scheduler-59cbb6c57-w2f44_godel-system(c8e33ae7-6cd5-4ca6-93a2-d67726d33c47)
$ kubectl logs scheduler-59cbb6c57-w2f44 -n godel-system
/usr/local/bin/scheduler: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/bin/scheduler)
/usr/local/bin/scheduler: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/local/bin/scheduler)
yuey002 commented 3 months ago

@katoomegumi Thanks for retrying and sharing the details! This error looks weird to me, since we are running the same Dockerfile which include all the build process... I think it may be because the old godel-local image was not cleaned up properly.

I made a few changes additionally to my branch 'dev/yuey002/fix-dockerfile' for my forked repo https://github.com/yuey002/godel-scheduler/tree/dev/yuey002/fix-dockerfile. Specifically, I used debian:latest for the base image which have GLIBC version up to 2.36. When getting a chance, could you quickly try: 1- check out my branch git checkout dev/yuey002/fix-dockerfile 2- set up the local cluster env. Could you please also paste me the output for 'make local-up'? make local-up 3- check godel component pods status kubectl get po -n godel-system 4- if still the same error, ssh into the pod and check the glibc version

kubectl exec -it scheduler-77cfcb585d-ldzqp -n godel-system -- /bin/bash

root@scheduler-77cfcb585d-ldzqp:~# ldd --version

Thanks.

katoomegumi commented 3 months ago

@yuey002 Thanks for your reply! I tried as your words. I clone the branch again and check it.

$ git checkout dev/yuey002/fix-dockerfile
M       manifests/quickstart-feature-examples/godel-demo-default.yaml
Already on 'dev/yuey002/fix-dockerfile'
Your branch is up to date with 'origin/dev/yuey002/fix-dockerfile'.

Because my kubectl version is 1.29.2, I change the 'manifests/quickstart-feature-examples/godel-demo-default.yaml'. I think it won't cause the error.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: godel-demo-default
nodes:
  - role: control-plane
    image: kindest/node:v1.29.2
  - role: worker
    image: kindest/node:v1.29.2

Here is the output for make local up

$ make local-up
find: ‘build’: No such file or directory
dirname: missing operand
Try 'dirname --help' for more information.
./hack/make-rules/build-images.sh
Building docker image(s) for ...
Total reclaimed space: 0B
Total reclaimed space: 0B
Untagged: godel-local:5d8c5de
Untagged: godel-local:latest
Deleted: sha256:883a86e804057fedd3538351e951ebbbdc68ab60d047aa441d8f54e3895558ba
Error response from daemon: No such image: 883a86e80405:latest
[+] Building 16.1s (21/21) FINISHED                                                                                                                                             docker:default
 => [internal] load build definition from godel-local.Dockerfile                                                                                                                          0.0s
 => => transferring dockerfile: 661B                                                                                                                                                      0.0s
 => [internal] load metadata for docker.io/library/debian:latest                                                                                                                         15.3s
 => [internal] load metadata for docker.io/library/golang:1.21                                                                                                                            0.9s
 => [internal] load .dockerignore                                                                                                                                                         0.0s
 => => transferring context: 2B                                                                                                                                                           0.0s
 => [builder  1/11] FROM docker.io/library/golang:1.21@sha256:856073656d1a517517792e6cdd2f7a5ef080d3ca2dff33e518c8412f140fdd2d                                                            0.0s
 => [internal] load build context                                                                                                                                                         0.7s
 => => transferring context: 634.51kB                                                                                                                                                     0.7s
 => [stage-1 1/4] FROM docker.io/library/debian:latest@sha256:2906804d2a64e8a13a434a1a127fe3f6a28bf7cf3696be4223b06276f32f1f2d                                                            0.0s
 => CACHED [stage-1 2/4] RUN apt-get update &&     apt-get install -y binutils &&     apt-get clean &&     ldd --version                                                                  0.0s
 => CACHED [stage-1 3/4] WORKDIR /root                                                                                                                                                    0.0s
 => CACHED [builder  2/11] WORKDIR /workspace                                                                                                                                             0.0s
 => CACHED [builder  3/11] COPY go.mod go.mod                                                                                                                                             0.0s
 => CACHED [builder  4/11] COPY go.sum go.sum                                                                                                                                             0.0s
 => CACHED [builder  5/11] COPY cmd/ cmd/                                                                                                                                                 0.0s
 => CACHED [builder  6/11] COPY pkg/ pkg/                                                                                                                                                 0.0s
 => CACHED [builder  7/11] COPY hack/ hack/                                                                                                                                               0.0s
 => CACHED [builder  8/11] COPY vendor/ vendor/                                                                                                                                           0.0s
 => CACHED [builder  9/11] COPY Makefile Makefile                                                                                                                                         0.0s
 => CACHED [builder 10/11] COPY Makefile.expansion Makefile.expansion                                                                                                                     0.0s
 => CACHED [builder 11/11] RUN export GO_BUILD_PLATFORMS=linux/amd64 && make build                                                                                                        0.0s
 => CACHED [stage-1 4/4] COPY --from=builder /workspace/bin/linux_amd64/* /usr/local/bin/                                                                                                 0.0s
 => exporting to image                                                                                                                                                                    0.0s
 => => exporting layers                                                                                                                                                                   0.0s
 => => writing image sha256:883a86e804057fedd3538351e951ebbbdc68ab60d047aa441d8f54e3895558ba                                                                                              0.0s
 => => naming to docker.io/library/godel-local:5d8c5de                                                                                                                                    0.0s
bash ./hack/make-rules/local-up.sh godel-demo-default
+++ dirname ./hack/make-rules/local-up.sh
++ cd ./hack/make-rules/../..
++ pwd -P
+ REPO_ROOT=/home/szp/godel-scheduler
+ CLUSTER_NAME=godel-demo-default
+ create_cluster /home/szp/godel-scheduler/manifests/quickstart-feature-examples/godel-demo-default.yaml
+ local cluster_config=/home/szp/godel-scheduler/manifests/quickstart-feature-examples/godel-demo-default.yaml
+ nohup kind delete cluster --name=godel-demo-default
+ kind create cluster --config=/home/szp/godel-scheduler/manifests/quickstart-feature-examples/godel-demo-default.yaml
Creating cluster "godel-demo-default" ...
 ✓ Ensuring node image (kindest/node:v1.29.2) 🖼
 ✓ Preparing nodes 📦 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining worker nodes 🚜 
Set kubectl context to "kind-godel-demo-default"
You can now use your cluster with:

kubectl cluster-info --context kind-godel-demo-default

Thanks for using kind! 😊
+ kind load docker-image --nodes godel-demo-default-control-plane godel-local:latest --name godel-demo-default
Image: "godel-local:latest" with ID "sha256:883a86e804057fedd3538351e951ebbbdc68ab60d047aa441d8f54e3895558ba" not yet present on node "godel-demo-default-control-plane", loading...
+ kustomize build /home/szp/godel-scheduler/manifests/base
+ kubectl apply -f -
# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
namespace/godel-system created
customresourcedefinition.apiextensions.k8s.io/customnoderesources.node.katalyst.kubewharf.io created
customresourcedefinition.apiextensions.k8s.io/nmnodes.node.godel.kubewharf.io created
customresourcedefinition.apiextensions.k8s.io/podgroups.scheduling.godel.kubewharf.io created
customresourcedefinition.apiextensions.k8s.io/schedulers.scheduling.godel.kubewharf.io created
serviceaccount/godel created
clusterrole.rbac.authorization.k8s.io/godel created
clusterrolebinding.rbac.authorization.k8s.io/godel created
configmap/godel-binder-config created
configmap/godel-scheduler-config created
deployment.apps/binder created
deployment.apps/dispatcher created
deployment.apps/scheduler created

And I can't connect to it. I check the container and there's no that container named 'scheduler'. I don't know if the problem is creating container of scheduler?

$  kubectl exec -it scheduler-59cbb6c57-pxj8t -n godel-system -- /bin/bash
error: unable to upgrade connection: container not found ("scheduler")

$  docker ps
CONTAINER ID   IMAGE                                                COMMAND                  CREATED          STATUS          PORTS                                                                                                                             NAMES
0979d0615a86   kindest/node:v1.29.2                                 "/usr/local/bin/entr…"   20 minutes ago   Up 20 minutes   127.0.0.1:41913->6443/tcp                                                                                                         godel-demo-default-control-plane
2c67ed533c3f   kindest/node:v1.29.2                                 "/usr/local/bin/entr…"   20 minutes ago   Up 20 minutes                                                                                                                                     godel-demo-default-worker
cb2cbd3be06e   deathstarbench/social-network-microservices:latest   "PostStorageService"     5 weeks ago      Up 8 days       0.0.0.0:10002->9090/tcp, :::10002->9090/tcp                                                                                       socialnetwork_post-storage-service_1
673b45159edf   deathstarbench/social-network-microservices:latest   "MediaService"           5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_media-service_1
d7d4a2f122b0   deathstarbench/social-network-microservices:latest   "SocialGraphService"     5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_social-graph-service_1
eaac26a830ae   deathstarbench/social-network-microservices:latest   "UserService"            5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_user-service_1
61d63b4ca889   yg397/openresty-thrift:xenial                        "/usr/local/openrest…"   5 weeks ago      Up 8 days       0.0.0.0:8080->8080/tcp, :::8080->8080/tcp                                                                                         socialnetwork_nginx-thrift_1
81640a1c7722   yg397/media-frontend:xenial                          "/usr/local/openrest…"   5 weeks ago      Up 8 days       0.0.0.0:8081->8080/tcp, :::8081->8080/tcp                                                                                         socialnetwork_media-frontend_1
2918ab8242d0   deathstarbench/social-network-microservices:latest   "UserTimelineService"    5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_user-timeline-service_1
3aa3d82dcb10   deathstarbench/social-network-microservices:latest   "UserMentionService"     5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_user-mention-service_1
06389ac3c30c   deathstarbench/social-network-microservices:latest   "UniqueIdService"        5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_unique-id-service_1
39fc3ee2d205   deathstarbench/social-network-microservices:latest   "HomeTimelineService"    5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_home-timeline-service_1
c29b6c1a32bc   deathstarbench/social-network-microservices:latest   "ComposePostService"     5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_compose-post-service_1
c073fdab9dfa   deathstarbench/social-network-microservices:latest   "UrlShortenService"      5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_url-shorten-service_1
4e0aaa449d59   deathstarbench/social-network-microservices:latest   "TextService"            5 weeks ago      Up 8 days                                                                                                                                         socialnetwork_text-service_1
51a699339b56   jaegertracing/all-in-one:latest                      "/go/bin/all-in-one-…"   5 weeks ago      Up 8 days       4317-4318/tcp, 5775/udp, 5778/tcp, 9411/tcp, 14250/tcp, 14268/tcp, 6831-6832/udp, 0.0.0.0:16686->16686/tcp, :::16686->16686/tcp   socialnetwork_jaeger-agent_1
da52653b9488   mongo:4.4.6                                          "docker-entrypoint.s…"   5 weeks ago      Up 8 days       27017/tcp                                                                                                                         socialnetwork_url-shorten-mongodb_1
f1f5675ed65d   redis                                                "docker-entrypoint.s…"   5 weeks ago      Up 8 days       6379/tcp                                                                                                                          socialnetwork_social-graph-redis_1
c1a6e49ad8b6   mongo:4.4.6                                          "docker-entrypoint.s…"   5 weeks ago      Up 8 days       27017/tcp                                                                                                                         socialnetwork_user-timeline-mongodb_1
ed35bcf9acb0   mongo:4.4.6                                          "docker-entrypoint.s…"   5 weeks ago      Up 8 days       27017/tcp                                                                                                                         socialnetwork_social-graph-mongodb_1
5103515e1671   memcached                                            "docker-entrypoint.s…"   5 weeks ago      Up 8 days       11211/tcp                                                                                                                         socialnetwork_user-memcached_1
ed5a645dfdd0   redis                                                "docker-entrypoint.s…"   5 weeks ago      Up 8 days       6379/tcp                                                                                                                          socialnetwork_home-timeline-redis_1
8197de30b166   memcached                                            "docker-entrypoint.s…"   5 weeks ago      Up 8 days       11211/tcp                                                                                                                         socialnetwork_media-memcached_1
f91275667be7   mongo:4.4.6                                          "docker-entrypoint.s…"   5 weeks ago      Up 8 days       27017/tcp                                                                                                                         socialnetwork_media-mongodb_1
9273e04185ad   mongo:4.4.6                                          "docker-entrypoint.s…"   5 weeks ago      Up 8 days       27017/tcp                                                                                                                         socialnetwork_user-mongodb_1
3611ab173907   mongo:4.4.6                                          "docker-entrypoint.s…"   5 weeks ago      Up 8 days       27017/tcp                                                                                                                         socialnetwork_post-storage-mongodb_1
82d545d7f960   redis                                                "docker-entrypoint.s…"   5 weeks ago      Up 8 days       6379/tcp                                                                                                                          socialnetwork_user-timeline-redis_1
7deabbfba02a   memcached                                            "docker-entrypoint.s…"   5 weeks ago      Up 8 days       11211/tcp                                                                                                                         socialnetwork_post-storage-memcached_1
cbdf730f77f5   memcached                                            "docker-entrypoint.s…"   5 weeks ago      Up 8 days       11211/tcp                                                                                                                         socialnetwork_url-shorten-memcached_1
yuey002 commented 3 months ago

@katoomegumi Thanks for this follow-up. Is your scheduler-59cbb6c57-pxj8t pod still the same error logs?

kubectl logs scheduler-59cbb6c57-pxj8t -n godel-system
katoomegumi commented 3 months ago

kubectl logs scheduler-59cbb6c57-pxj8t -n godel-system

yes.

v/usr/local/bin/scheduler: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/bin/scheduler)
/usr/local/bin/scheduler: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /usr/local/bin/scheduler)
yuey002 commented 3 months ago

@katoomegumi Thanks! We'd need a bit more information for your local env to further triage this issue. Could you check the digests for debian:latest

docker images --digests | grep -i debian | grep -i latest

The expected digests sha should be e97ee92bf1e11a2de654e9f3da827d8dce32b54e0490ac83bfc65c8706568116. If not, then the debian image is probably the root cause.

katoomegumi commented 3 months ago

@yuey002 When I execute the command, it give no output. It seems that there's no images named debian. I pull debian image and get the result. What should I do to fix it?

debian                                        latest      sha256:2906804d2a64e8a13a434a1a127fe3f6a28bf7cf3696be4223b06276f32f1f2d   6f4986d78878   2 years ago     124MB
yuey002 commented 2 months ago

@katoomegumi I see, it's most likely due to the older debian image then. One thing you could do is to delete your local debian:latest image, and then pull that image again. For how to delete the image, https://docs.docker.com/reference/cli/docker/image/rm/

To make things easier, I modified my Dockerfile to pin a specific version for debian (debian:bookworm). If you pull dev/yuey002/fix-dockerfile branch again you should be able to see the change. Run make local-up again to see if the three godel pods can be up and running this time.

Below is a screenshot of my output. You could see debian:bookworm as well as the digest SHA of the image.

image

Thanks.

katoomegumi commented 2 months ago

@yuey002 Thanks very much, it works.

yuey002 commented 2 months ago

@NickrenREN Could you help take a look at https://github.com/kubewharf/godel-scheduler/pull/39 ? It's some improvements for the local env set-up, so that we can prevent similar issues in the future. After it's merged I think we can close this issue. Thanks.

NickrenREN commented 2 months ago

@yuey002 Sure, get it merged, thanks for the fix.