Open juliev0 opened 2 days ago
Investigating this currently. I have found that the MonoVertexRollout will usually fail with only 3/4 replicas being healthy. Below I have the logs for describing a pod and checking the logs of this pod to see what the issue may be.
kubectl describe po test-monovertex-rollout-0-mv-0-3ejmc
Name: test-monovertex-rollout-0-mv-0-3ejmc
Namespace: numaplane-system
Priority: 0
Service Account: default
Node: k3d-k3s-default-server-0/172.18.0.3
Start Time: Tue, 29 Oct 2024 11:06:22 -0700
Labels: app.kubernetes.io/component=mono-vertex
app.kubernetes.io/managed-by=mono-vertex-controller
app.kubernetes.io/name=test-monovertex-rollout-0
app.kubernetes.io/part-of=numaflow
numaflow.numaproj.io/mono-vertex-name=test-monovertex-rollout-0
Annotations: kubectl.kubernetes.io/default-container: numa
numaflow.numaproj.io/hash: bf9d80c762c5db5ae97358f7e6c8570ec50e46f3c2c43cd0a7b0072687cd59f1
numaflow.numaproj.io/replica: 0
Status: Running
IP: 10.42.0.188
IPs:
IP: 10.42.0.188
Controlled By: MonoVertex/test-monovertex-rollout-0
Containers:
numa:
Container ID: containerd://0e3f50dfed809e37b7a4cafe7737d132356475d7801c7f4f616e5d8f5f7e857d
Image: quay.io/numaproj/numaflow:v1.3.3
Image ID: quay.io/numaproj/numaflow@sha256:82c46418188694f91e001009f49dadce380a435f8ad444bb024fab813b3e5845
Port: 2469/TCP
Host Port: 0/TCP
Command:
/bin/numaflow-rs
Args:
--monovertex
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 29 Oct 2024 11:06:54 -0700
Finished: Tue, 29 Oct 2024 11:06:54 -0700
Ready: False
Restart Count: 2
Requests:
cpu: 100m
memory: 128Mi
Liveness: http-get https://:2469/livez delay=20s timeout=30s period=60s #success=1 #failure=5
Readiness: http-get https://:2469/readyz delay=5s timeout=30s period=10s #success=1 #failure=6
Environment:
NUMAFLOW_MONO_VERTEX_OBJECT: eyJtZXRhZGF0YSI6eyJuYW1lIjoidGVzdC1tb25vdmVydGV4LXJvbGxvdXQtMCIsIm5hbWVzcGFjZSI6Im51bWFwbGFuZS1zeXN0ZW0iLCJjcmVhdGlvblRpbWVzdGFtcCI6bnVsbH0sInNwZWMiOnsicmVwbGljYXMiOjAsInNvdXJjZSI6eyJ0cmFuc2Zvcm1lciI6eyJjb250YWluZXIiOnsiaW1hZ2UiOiJxdWF5LmlvL251bWFpby9udW1hZmxvdy1ycy9zb3VyY2UtdHJhbnNmb3JtZXItbm93OnN0YWJsZSIsInJlc291cmNlcyI6e319LCJidWlsdGluIjpudWxsfSwidWRzb3VyY2UiOnsiY29udGFpbmVyIjp7ImltYWdlIjoicXVheS5pby9udW1haW8vbnVtYWZsb3ctamF2YS9zb3VyY2Utc2ltcGxlLXNvdXJjZTpzdGFibGUiLCJyZXNvdXJjZXMiOnt9fX19LCJzaW5rIjp7InVkc2luayI6eyJjb250YWluZXIiOnsiaW1hZ2UiOiJxdWF5LmlvL251bWFpby9udW1hZmxvdy1qYXZhL3NpbXBsZS1zaW5rOnN0YWJsZSIsInJlc291cmNlcyI6e319fSwicmV0cnlTdHJhdGVneSI6e319LCJsaW1pdHMiOnsicmVhZEJhdGNoU2l6ZSI6NTAwLCJyZWFkVGltZW91dCI6IjFzIn0sInNjYWxlIjp7fSwidXBkYXRlU3RyYXRlZ3kiOnt9LCJsaWZlY3ljbGUiOnt9fSwic3RhdHVzIjp7InJlcGxpY2FzIjowLCJkZXNpcmVkUmVwbGljYXMiOjAsImxhc3RVcGRhdGVkIjpudWxsLCJsYXN0U2NhbGVkQXQiOm51bGx9fQ==
NUMAFLOW_NAMESPACE: numaplane-system (v1:metadata.namespace)
NUMAFLOW_POD: test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
NUMAFLOW_REPLICA: (v1:metadata.annotations['numaflow.numaproj.io/replica'])
NUMAFLOW_MONO_VERTEX_NAME: test-monovertex-rollout-0
Mounts:
/var/run/numaflow from var-run-numaflow (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
udsource:
Container ID: containerd://042bff128d7bceb75523c049b6f417f92c5d29dcad427830ae9e20471ad90761
Image: quay.io/numaio/numaflow-java/source-simple-source:stable
Image ID: quay.io/numaio/numaflow-java/source-simple-source@sha256:5f756dd0ec38e2c5cd7fbc41196c7364a71fed18c8eeb4b988bb198bcd72eee6
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 29 Oct 2024 11:06:23 -0700
Ready: True
Restart Count: 0
Liveness: http-get https://:2469/sidecar-livez delay=30s timeout=30s period=60s #success=1 #failure=5
Environment:
NUMAFLOW_UD_CONTAINER_TYPE: udsource
NUMAFLOW_NAMESPACE: numaplane-system (v1:metadata.namespace)
NUMAFLOW_POD: test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
NUMAFLOW_REPLICA: (v1:metadata.annotations['numaflow.numaproj.io/replica'])
NUMAFLOW_MONO_VERTEX_NAME: test-monovertex-rollout-0
NUMAFLOW_CPU_LIMIT: node allocatable (limits.cpu)
NUMAFLOW_CPU_REQUEST: 0 (requests.cpu)
NUMAFLOW_MEMORY_LIMIT: node allocatable (limits.memory)
NUMAFLOW_MEMORY_REQUEST: 0 (requests.memory)
Mounts:
/var/run/numaflow from var-run-numaflow (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
transformer:
Container ID: containerd://cd11c976d325d7d6300ef68217fcb4d13a1748bce947ef7a7258990e15624531
Image: quay.io/numaio/numaflow-rs/source-transformer-now:stable
Image ID: quay.io/numaio/numaflow-rs/source-transformer-now@sha256:c0396a7390bc171f23bf870422026cde574802323922c7e1da09449d2098b649
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 29 Oct 2024 11:06:23 -0700
Ready: True
Restart Count: 0
Liveness: http-get https://:2469/sidecar-livez delay=30s timeout=30s period=60s #success=1 #failure=5
Environment:
NUMAFLOW_UD_CONTAINER_TYPE: transformer
NUMAFLOW_NAMESPACE: numaplane-system (v1:metadata.namespace)
NUMAFLOW_POD: test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
NUMAFLOW_REPLICA: (v1:metadata.annotations['numaflow.numaproj.io/replica'])
NUMAFLOW_MONO_VERTEX_NAME: test-monovertex-rollout-0
NUMAFLOW_CPU_LIMIT: node allocatable (limits.cpu)
NUMAFLOW_CPU_REQUEST: 0 (requests.cpu)
NUMAFLOW_MEMORY_LIMIT: node allocatable (limits.memory)
NUMAFLOW_MEMORY_REQUEST: 0 (requests.memory)
Mounts:
/var/run/numaflow from var-run-numaflow (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
udsink:
Container ID: containerd://d488acf47b969cf0fc3b0314fbb661abef792bb606eba1dd999d7cef619b5c12
Image: quay.io/numaio/numaflow-java/simple-sink:stable
Image ID: quay.io/numaio/numaflow-java/simple-sink@sha256:e09b82d8ad753058ccc30784e6c8fad49f5c226b24e1e76abc3bd90ac1aa8c39
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 29 Oct 2024 11:06:23 -0700
Ready: True
Restart Count: 0
Liveness: http-get https://:2469/sidecar-livez delay=30s timeout=30s period=60s #success=1 #failure=5
Environment:
NUMAFLOW_UD_CONTAINER_TYPE: udsink
NUMAFLOW_NAMESPACE: numaplane-system (v1:metadata.namespace)
NUMAFLOW_POD: test-monovertex-rollout-0-mv-0-3ejmc (v1:metadata.name)
NUMAFLOW_REPLICA: (v1:metadata.annotations['numaflow.numaproj.io/replica'])
NUMAFLOW_MONO_VERTEX_NAME: test-monovertex-rollout-0
NUMAFLOW_CPU_LIMIT: node allocatable (limits.cpu)
NUMAFLOW_CPU_REQUEST: 0 (requests.cpu)
NUMAFLOW_MEMORY_LIMIT: node allocatable (limits.memory)
NUMAFLOW_MEMORY_REQUEST: 0 (requests.memory)
Mounts:
/var/run/numaflow from var-run-numaflow (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-trpqg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
var-run-numaflow:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
kube-api-access-trpqg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 46s default-scheduler Successfully assigned numaplane-system/test-monovertex-rollout-0-mv-0-3ejmc to k3d-k3s-default-server-0
Normal Pulled 46s kubelet Container image "quay.io/numaio/numaflow-java/source-simple-source:stable" already present on machine
Normal Created 46s kubelet Created container udsource
Normal Started 46s kubelet Started container udsource
Normal Pulled 46s kubelet Container image "quay.io/numaio/numaflow-rs/source-transformer-now:stable" already present on machine
Normal Created 46s kubelet Created container transformer
Normal Started 46s kubelet Started container transformer
Normal Pulled 46s kubelet Container image "quay.io/numaio/numaflow-java/simple-sink:stable" already present on machine
Normal Created 46s kubelet Created container udsink
Normal Started 46s kubelet Started container udsink
Warning Unhealthy 36s kubelet Readiness probe failed: Get "https://10.42.0.188:2469/readyz": dial tcp 10.42.0.188:2469: connect: connection refused
Normal Pulled 15s (x3 over 46s) kubelet Container image "quay.io/numaproj/numaflow:v1.3.3" already present on machine
Normal Created 15s (x3 over 46s) kubelet Created container numa
Normal Started 15s (x3 over 46s) kubelet Started container numa
Warning BackOff 6s (x4 over 35s) kubelet Back-off restarting failed container numa in pod test-monovertex-rollout-0-mv-0-3ejmc_numaplane-system(4a541a90-77c7-4777-8c16-c19dbe36991f)
kubectl logs -f test-monovertex-rollout-0-mv-0-rgdpn
2024-10-29T18:08:57.927762Z INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "java", minimum_numaflow_version: "1.3.1-z", version: "0.8.0", metadata: Some({}) }
2024-10-29T18:08:57.927791Z INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/aarch64" }
2024-10-29T18:08:57.928258Z INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "java", minimum_numaflow_version: "1.3.1-z", version: "0.8.0", metadata: Some({}) }
2024-10-29T18:08:57.928266Z INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/aarch64" }
2024-10-29T18:08:57.928472Z INFO monovertex::server_info: Server info file: ServerInfo { protocol: "uds", language: "rust", minimum_numaflow_version: "1.3.1-z", version: "0.1.1", metadata: Some({}) }
2024-10-29T18:08:57.928489Z INFO monovertex::server_info: Version_info: VersionInfo { version: "latest+unknown", build_date: "1970-01-01T00:00:00Z", git_commit: "", git_tag: "", git_tree_state: "", go_version: "unknown", compiler: "", platform: "linux/aarch64" }
2024-10-29T18:08:58.073575Z INFO monovertex::metrics: Stopped the Lag-Reader Expose and Builder tasks
2024-10-29T18:08:58.073589Z ERROR monovertex: Application error: GRPCError("status: InvalidArgument, message: \"Handshake request not received\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\"} }")
2024-10-29T18:08:58.073592Z INFO monovertex: Gracefully Exiting...
I believe this could be due to the wrong stable images in quay. Currently the :stable
points to lastest changes of the SDK which is not compatible with 1.3 of Numaflow. We might have to build out :v1.3
images.
@kohlisid could you please take a look into this? Relates to https://github.com/numaproj/numaflow/issues/2163
Reassigned to @kohlisid - thanks!
Ran e2e test (with PPND=true but not sure if it matters) and see MonoVertex Pod in a crash loop.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.