Open DecathectZero opened 3 weeks ago
Hi,
Looking at the error this seems to be an issue with kubectl port-forward
(on the API level)
2024-11-06T18:42:14.387598Z WARN mirrord_kube::api::kubernetes::portforwarder: finished retry strategy, closing connection, connect_info: AgentKubernetesConnectInfo { pod_name: "mirrord-agent-wk5eyjv6hc-vllzt", agent_port: 44128, namespace: Some("dev-tyler"), agent_version: Some("3.122.1") }
at mirrord/kube/src/api/kubernetes/portforwarder.rs:172
We know that kubectl port-forward
tends to be flaky in some cases, but it's kinda out of our control (we even do retries as seen in the logs to handle it).
You can check your kube-apiserver if it restarted/had any error but other then that I think it's not something we can solve from our side.
P.S - The operator doesn't use kubectl port-forward (it implements its own API) and we didn't see a report of such issue for users of it - so perhaps that can be a solution.
It seems like the agent just dies when this happens. Retried again but with mirrord 3.124.0
Same thing in mirrord-intproxy
:
2024-11-06T19:27:15.175003Z WARN mirrord_kube::api::kubernetes::portforwarder: finished retry strategy, closing connection, connect_info: AgentKubernetesConnectInfo { pod_name: "mirrord-agent-nxavbqhb4z-6strh", agent_port: 32170, namespace: Some("dev-tyler"), agent_version: Some("3.122.1") }
at mirrord/kube/src/api/kubernetes/portforwarder.rs:172
2024-11-06T19:27:15.190863Z ERROR mirrord_intproxy::agent_conn: failed to receive message from the agent, inner task down
at mirrord/intproxy/src/agent_conn.rs:226
2024-11-06T19:27:15.191048Z ERROR mirrord_intproxy: task AGENT_CONNECTION failed: agent unexpectedly closed connection
at mirrord/intproxy/src/lib.rs:243
2024-11-06T19:27:15.193633Z ERROR mirrord::internal_proxy: Internal proxy encountered an error, exiting, error: Main internal proxy logic failed: agent unexpectedly closed connection
at mirrord/cli/src/internal_proxy.rs:146
Agent dying when connection ends (because port forward died) makes sense (desired logic). If agent crashed I would've expected it to send "bye" message. If you'd like to make sure it's not agent dying for some reason I'd tail the logs of the agent and see if you get any error besides "no connections alive, shutting down"
Yup after investigating more port forward is definitely the culprit. Is there any good way around this at all? https://github.com/kubernetes/kubectl/issues/1525#issuecomment-2196452123
We can look into the operator, since it's not using port-forward
Yup after investigating more port forward is definitely the culprit. Is there any good way around this at all? kubernetes/kubectl#1525 (comment)
We can look into the operator, since it's not using port-forward
Reading the issue you mentioned I don't really understand what's going on since the crate we use for kube uses websocket already, but maybe k8s does websocket -> spdy -> in the webserver ? in any case it looks like a longer term thing (we need kube crate to support the new connection if needed to change) and it might be using it already tbh
This is still happening quite frequently on our end.
Before on Kubernetes 1.27 it was not noticeable, but after upgrading to Kubernetes 1.30 it's happening quite often and has been distruptive.
I've enabled logs for the Kubernetes apiserver via AWS EKS, but nothing significant has appeared there. I can send them over to you via a private channel if you want.
Is there anything in particular that you are looking for in the kube-apiserver
logs?
We found out that user used calico CNI +eBPF - need to retest.
Bug Description
After a properly established connection, sometimes I notice the following error pops out of nowhere:
/tmp/mirrod.logs
:Steps to Reproduce
N/A hard to reproduce but happens randomly
Backtrace
No response
mirrord layer logs
No response
mirrord intproxy logs
No response
mirrord agent logs
No response
mirrord config
No response
mirrord CLI version
mirrord 3.122.1
mirrord-agent version
ghcr.io/metalbear-co/mirrord:3.122.1
mirrord-operator version (if relevant)
No response
plugin kind and version (if relevant)
No response
Your operating system and version
OSx Sonoma 14.4.1
Local process
nodeJS
Local process version
nodeJS v20.11.1
Additional Info
No response