radius-project / radius

Radius is a cloud-native, portable application platform that makes app development easier for teams building cloud-native apps.
https://radapp.io
Apache License 2.0
1.47k stars 94 forks source link

`dapr-placement-server` last state Terminated and due to an error #6845

Open ytimocin opened 10 months ago

ytimocin commented 10 months ago

Bug information

I have seen the following error in at least 10 different functional test runs in the daprrp-tests-pod-states.log. Here is the last run that I saw this error: https://github.com/radius-project/radius/actions/runs/6962417062.

I am trying to reproduce this locally.

Name:             dapr-placement-server-0
Namespace:        dapr-system
Priority:         0
Service Account:  dapr-placement
Node:             radius-control-plane/172.18.0.2
Start Time:       Wed, 22 Nov 2023 20:36:19 +0000
Labels:           app=dapr-placement-server
                  app.kubernetes.io/component=placement
                  app.kubernetes.io/managed-by=helm
                  app.kubernetes.io/name=dapr
                  app.kubernetes.io/part-of=dapr
                  app.kubernetes.io/version=1.11.0
                  controller-revision-hash=dapr-placement-server-64bbd94b58
                  statefulset.kubernetes.io/pod-name=dapr-placement-server-0
Annotations:      prometheus.io/path: /
                  prometheus.io/port: 9090
                  prometheus.io/scrape: true
Status:           Running
IP:               10.244.0.4
IPs:
  IP:           10.244.0.4
Controlled By:  StatefulSet/dapr-placement-server
Containers:
  dapr-placement-server:
    Container ID:  containerd://c6d1cb3b77758c37cede67f245c5b7daaf7fedb175465336a60364ab17cc2ab4
    Image:         docker.io/daprio/placement:1.11.0
    Image ID:      docker.io/daprio/placement@sha256:dc588e925ac77e5d002115781b126dbcff6473f676210ea7e61a5343e678f142
    Ports:         50005/TCP, 8201/TCP, 9090/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      /placement
    Args:
      --log-level
      info
      --enable-metrics
      --replicationFactor
      100
      --metrics-port
      9090
      --tls-enabled
    State:          Running
      Started:      Wed, 22 Nov 2023 20:36:39 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 22 Nov 2023 20:36:25 +0000
      Finished:     Wed, 22 Nov 2023 20:36:25 +0000
    Ready:          True
    Restart Count:  2
    Liveness:       http-get http://:8080/healthz delay=10s timeout=1s period=3s #success=1 #failure=5
    Readiness:      http-get http://:8080/healthz delay=3s timeout=1s period=3s #success=1 #failure=5
    Environment:
      PLACEMENT_ID:  dapr-placement-server-0 (v1:metadata.name)
      NAMESPACE:     dapr-system (v1:metadata.namespace)
    Mounts:
      /var/run/dapr/credentials from credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6rt8v (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  dapr-trust-bundle
    Optional:    false
  kube-api-access-6rt8v:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m21s                  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   Scheduled         3m19s                  default-scheduler  Successfully assigned dapr-system/dapr-placement-server-0 to radius-control-plane
  Normal   Pulling           3m18s                  kubelet            Pulling image "docker.io/daprio/placement:1.11.0"
  Normal   Pulled            3m14s                  kubelet            Successfully pulled image "docker.io/daprio/placement:1.11.0" in 1.931384876s (3.840101109s including waiting)
  Warning  BackOff           3m11s (x3 over 3m13s)  kubelet            Back-off restarting failed container dapr-placement-server in pod dapr-placement-server-0_dapr-system(deca8321-69bc-4ca5-a85c-9ca1b7ccbcc8)
  Normal   Created           2m59s (x3 over 3m14s)  kubelet            Created container dapr-placement-server
  Normal   Started           2m59s (x3 over 3m14s)  kubelet            Started container dapr-placement-server
  Normal   Pulled            2m59s (x2 over 3m14s)  kubelet            Container image "docker.io/daprio/placement:1.11.0" already present on machine

Steps to reproduce (required)

  1. Trying to reproduce this locally but you can see it in most functional test runs.
  2. What I will do to see this locally is to have a fresh cluster and rad installation and run daprrp tests.

Observed behavior (required)

dapr-placement-server error.

Desired behavior (required)

All pods should be in a Healthy state.

Workaround (optional)

Not blocking the functional tests for now.

System information

rad Version (required)

Scheduled run and PR run.

Operating system (required)

Scheduled run and PR run.

Additional context

daprrp_container_logs (4).zip

AB#10475

radius-triage-bot[bot] commented 10 months ago

:wave: @ytimocin Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

radius-triage-bot[bot] commented 10 months ago

:+1: We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview