numaproj / numaplane

Control Plane for Numaproj
Apache License 2.0
8 stars 5 forks source link

MonoVertex Pod in crash loop in e2e test #371

Open juliev0 opened 2 days ago

juliev0 commented 2 days ago

Ran e2e test (with PPND=true but not sure if it matters) and see MonoVertex Pod in a crash loop.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

dpadhiar commented 2 days ago

Investigating this currently. I have found that the MonoVertexRollout will usually fail with only 3/4 replicas being healthy. Below I have the logs for describing a pod and checking the logs of this pod to see what the issue may be.

kubectl describe po test-monovertex-rollout-0-mv-0-3ejmc
Name:             test-monovertex-rollout-0-mv-0-3ejmc
Namespace:        numaplane-system
Priority:         0
Service Account:  default
Node:             k3d-k3s-default-server-0/172.18.0.3
Start Time:       Tue, 29 Oct 2024 11:06:22 -0700
Labels:           app.kubernetes.io/component=mono-vertex
                  app.kubernetes.io/managed-by=mono-vertex-controller
                  app.kubernetes.io/name=test-monovertex-rollout-0
                  app.kubernetes.io/part-of=numaflow
                  numaflow.numaproj.io/mono-vertex-name=test-monovertex-rollout-0
Annotations:      kubectl.kubernetes.io/default-container: numa
                  numaflow.numaproj.io/hash: bf9d80c762c5db5ae97358f7e6c8570ec50e46f3c2c43cd0a7b0072687cd59f1
                  numaflow.numaproj.io/replica: 0
Status:           Running
IP:               10.42.0.188
IPs:
  IP:           10.42.0.188
Controlled By:  MonoVertex/test-monovertex-rollout-0
Containers:
  numa:
    Container ID:  containerd://0e3f50dfed809e37b7a4cafe7737d132356475d7801c7f4f616e5d8f5f7e857d
    Image:         quay.io/numaproj/numaflow:v1.3.3
    Image ID:      quay.io/numaproj/numaflow@sha256:82c46418188694f91e001009f49dadce380a435f8ad444bb024fab813b3e5845
    Port:          2469/TCP
    Host Port:     0/TCP
    Command:
      /bin/numaflow-rs
    Args:
      --monovertex
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 29 Oct 2024 11:06:54 -0700
      Finished:     Tue, 29 Oct 2024 11:06:54 -0700
    Ready:          False
    Restart Count:  2
    Requests:
      cpu:      100m
      memory:   128Mi
    Liveness:   http-get https://:2469/livez delay=20s timeout=30s period=60s #success=1 #failure=5
    Readiness:  http-get https://:2469/readyz delay=5s timeout=30s period=10s #success=1 #failure=6
    Environment:
      NUMAFLOW_MONO_VERTEX_OBJECT:  eyJtZXRhZGF0YSI6eyJuYW1lIjoidGVzdC1tb25vdmVydGV4LXJvbGxvdXQtMCIsIm5hbWVzcGFjZSI6Im51bWFwbGFuZS1zeXN0ZW0iLCJjcmVhdGlvblRpbWVzdGFtcCI6bnVsbH0sInNwZWMiOnsicmVwbGljYXMiOjAsInNvdXJjZSI6eyJ0cmFuc2Zvcm1lciI6eyJjb250YWluZXIiOnsiaW1hZ2UiOiJxdWF5LmlvL251bWFpby9udW1hZmxvdy1ycy9zb3VyY2UtdHJhbnNmb3JtZXItbm93OnN0YWJsZSIsInJlc291cmNlcyI6e319LCJidWlsdGluIjpudWxsfSwidWRzb3VyY2UiOnsiY29udGFpbmVyIjp7ImltYWdlIjoicXVheS5pby9udW1haW8vbnVtYWZsb3ctamF2YS9zb3VyY2Utc2ltcGxlLXNvdXJjZTpzdGFibGUiLCJyZXNvdXJjZXMiOnt9fX19LCJzaW5rIjp7InVkc2luayI6eyJjb250YWluZXIiOnsiaW1hZ2UiOiJxdWF5LmlvL251bWFpby9udW1hZmxvdy1qYXZhL3NpbXBsZS1zaW5rOnN0YWJsZSIsInJlc291cmNlcyI6e319fSwicmV0cnlTdHJhdGVneSI6e319LCJsaW1pdHMiOnsicmVhZEJhdGNoU2l6ZSI6NTAwLCJyZWFkVGltZW91dCI6IjFzIn0sInNjYWxlIjp7fSwidXBkYXRlU3RyYXRlZ3kiOnt9LCJsaWZlY3ljbGUiOnt9fSwic3RhdHVzIjp7InJlcGxpY2FzIjowLCJkZXNpcmVkUmVwbGljYXMiOjAsImxhc3RVcGRhdGVkIjpudWxsLCJsYXN0U2NhbGVkQXQiOm51bGx9fQ==
      NUMAFLOW_NAMESPACE:           numaplane-system (v1:metadata.namespace)
      NUMAFLOW_POD:                 test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
      NUMAFLOW_REPLICA:              (v1:metadata.annotations['numaflow.numaproj.io/replica'])
      NUMAFLOW_MONO_VERTEX_NAME:    test-monovertex-rollout-0
    Mounts:
      /var/run/numaflow from var-run-numaflow (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
  udsource:
    Container ID:   containerd://042bff128d7bceb75523c049b6f417f92c5d29dcad427830ae9e20471ad90761
    Image:          quay.io/numaio/numaflow-java/source-simple-source:stable
    Image ID:       quay.io/numaio/numaflow-java/source-simple-source@sha256:5f756dd0ec38e2c5cd7fbc41196c7364a71fed18c8eeb4b988bb198bcd72eee6
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 29 Oct 2024 11:06:23 -0700
    Ready:          True
    Restart Count:  0
    Liveness:       http-get https://:2469/sidecar-livez delay=30s timeout=30s period=60s #success=1 #failure=5
    Environment:
      NUMAFLOW_UD_CONTAINER_TYPE:  udsource
      NUMAFLOW_NAMESPACE:          numaplane-system (v1:metadata.namespace)
      NUMAFLOW_POD:                test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
      NUMAFLOW_REPLICA:             (v1:metadata.annotations['numaflow.numaproj.io/replica'])
      NUMAFLOW_MONO_VERTEX_NAME:   test-monovertex-rollout-0
      NUMAFLOW_CPU_LIMIT:          node allocatable (limits.cpu)
      NUMAFLOW_CPU_REQUEST:        0 (requests.cpu)
      NUMAFLOW_MEMORY_LIMIT:       node allocatable (limits.memory)
      NUMAFLOW_MEMORY_REQUEST:     0 (requests.memory)
    Mounts:
      /var/run/numaflow from var-run-numaflow (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
  transformer:
    Container ID:   containerd://cd11c976d325d7d6300ef68217fcb4d13a1748bce947ef7a7258990e15624531
    Image:          quay.io/numaio/numaflow-rs/source-transformer-now:stable
    Image ID:       quay.io/numaio/numaflow-rs/source-transformer-now@sha256:c0396a7390bc171f23bf870422026cde574802323922c7e1da09449d2098b649
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 29 Oct 2024 11:06:23 -0700
    Ready:          True
    Restart Count:  0
    Liveness:       http-get https://:2469/sidecar-livez delay=30s timeout=30s period=60s #success=1 #failure=5
    Environment:
      NUMAFLOW_UD_CONTAINER_TYPE:  transformer
      NUMAFLOW_NAMESPACE:          numaplane-system (v1:metadata.namespace)
      NUMAFLOW_POD:                test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
      NUMAFLOW_REPLICA:             (v1:metadata.annotations['numaflow.numaproj.io/replica'])
      NUMAFLOW_MONO_VERTEX_NAME:   test-monovertex-rollout-0
      NUMAFLOW_CPU_LIMIT:          node allocatable (limits.cpu)
      NUMAFLOW_CPU_REQUEST:        0 (requests.cpu)
      NUMAFLOW_MEMORY_LIMIT:       node allocatable (limits.memory)
      NUMAFLOW_MEMORY_REQUEST:     0 (requests.memory)
    Mounts:
      /var/run/numaflow from var-run-numaflow (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
  udsink:
    Container ID:   containerd://d488acf47b969cf0fc3b0314fbb661abef792bb606eba1dd999d7cef619b5c12
    Image:          quay.io/numaio/numaflow-java/simple-sink:stable
    Image ID:       quay.io/numaio/numaflow-java/simple-sink@sha256:e09b82d8ad753058ccc30784e6c8fad49f5c226b24e1e76abc3bd90ac1aa8c39
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 29 Oct 2024 11:06:23 -0700
    Ready:          True
    Restart Count:  0
    Liveness:       http-get https://:2469/sidecar-livez delay=30s timeout=30s period=60s #success=1 #failure=5
    Environment:
      NUMAFLOW_UD_CONTAINER_TYPE:  udsink
      NUMAFLOW_NAMESPACE:          numaplane-system (v1:metadata.namespace)
      NUMAFLOW_POD:                test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
      NUMAFLOW_REPLICA:             (v1:metadata.annotations['numaflow.numaproj.io/replica'])
      NUMAFLOW_MONO_VERTEX_NAME:   test-monovertex-rollout-0
      NUMAFLOW_CPU_LIMIT:          node allocatable (limits.cpu)
      NUMAFLOW_CPU_REQUEST:        0 (requests.cpu)
      NUMAFLOW_MEMORY_LIMIT:       node allocatable (limits.memory)
      NUMAFLOW_MEMORY_REQUEST:     0 (requests.memory)
    Mounts:
      /var/run/numaflow from var-run-numaflow (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  var-run-numaflow:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-trpqg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  46s                default-scheduler  Successfully assigned numaplane-system/test-monovertex-rollout-0-mv-0-3ejmc to k3d-k3s-default-server-0
  Normal   Pulled     46s                kubelet            Container image "quay.io/numaio/numaflow-java/source-simple-source:stable" already present on machine
  Normal   Created    46s                kubelet            Created container udsource
  Normal   Started    46s                kubelet            Started container udsource
  Normal   Pulled     46s                kubelet            Container image "quay.io/numaio/numaflow-rs/source-transformer-now:stable" already present on machine
  Normal   Created    46s                kubelet            Created container transformer
  Normal   Started    46s                kubelet            Started container transformer
  Normal   Pulled     46s                kubelet            Container image "quay.io/numaio/numaflow-java/simple-sink:stable" already present on machine
  Normal   Created    46s                kubelet            Created container udsink
  Normal   Started    46s                kubelet            Started container udsink
  Warning  Unhealthy  36s                kubelet            Readiness probe failed: Get "https://10.42.0.188:2469/readyz": dial tcp 10.42.0.188:2469: connect: connection refused
  Normal   Pulled     15s (x3 over 46s)  kubelet            Container image "quay.io/numaproj/numaflow:v1.3.3" already present on machine
  Normal   Created    15s (x3 over 46s)  kubelet            Created container numa
  Normal   Started    15s (x3 over 46s)  kubelet            Started container numa
  Warning  BackOff    6s (x4 over 35s)   kubelet            Back-off restarting failed container numa in pod test-monovertex-rollout-0-mv-0-3ejmc_numaplane-system(4a541a90-77c7-4777-8c16-c19dbe36991f)
kubectl logs -f test-monovertex-rollout-0-mv-0-rgdpn
2024-10-29T18:08:57.927762Z  INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "java", minimum_numaflow_version: "1.3.1-z", version: "0.8.0", metadata: Some({}) }
2024-10-29T18:08:57.927791Z  INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/aarch64" }
2024-10-29T18:08:57.928258Z  INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "java", minimum_numaflow_version: "1.3.1-z", version: "0.8.0", metadata: Some({}) }
2024-10-29T18:08:57.928266Z  INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/aarch64" }
2024-10-29T18:08:57.928472Z  INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "rust", minimum_numaflow_version: "1.3.1-z", version: "0.1.1", metadata: Some({}) }
2024-10-29T18:08:57.928489Z  INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/aarch64" }
2024-10-29T18:08:58.073575Z  INFO monovertex::metrics: Stopped the Lag-Reader Expose and Builder tasks
2024-10-29T18:08:58.073589Z ERROR monovertex: Application error: GRPCError("status: InvalidArgument, message: \"Handshake request not received\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\"} }")
2024-10-29T18:08:58.073592Z  INFO monovertex: Gracefully Exiting...
vigith commented 2 days ago

I believe this could be due to the wrong stable images in quay. Currently the :stable points to lastest changes of the SDK which is not compatible with 1.3 of Numaflow. We might have to build out :v1.3 images.

@kohlisid could you please take a look into this? Relates to https://github.com/numaproj/numaflow/issues/2163

juliev0 commented 2 days ago

Reassigned to @kohlisid - thanks!