metalbear-co / mirrord

Connect your local process and your cloud environment, and run local code in cloud conditions.
https://mirrord.dev
MIT License
3.84k stars 107 forks source link

Proxy randomly closes during connection #2901

Open DecathectZero opened 3 weeks ago

DecathectZero commented 3 weeks ago

Bug Description

After a properly established connection, sometimes I notice the following error pops out of nowhere:

Proxy error, connectivity issue or a bug.
                    Please report it to us on https://github.com/metalbear-co/mirrord/issues/new?assignees=&labels=bug&projects=&template=bug_report.yml
                    You can find the `mirrord-intproxy` logs in /tmp/mirrord.log.
                    connection closed

/tmp/mirrod.logs:

  2024-11-06T18:35:27.624023Z  INFO mirrord_intproxy::layer_initializer: new session, process_info: ProcessInfo { pid: 73153, name: "node", cmdline: ["/Users/tyler/.volta/tools/image/node/20.11.1/bin/node", "--enable-source-maps", "--inspect", "/Users/tyler/voiceflow/platform/apps/runtime/build/main"], loaded: true }
    at mirrord/intproxy/src/layer_initializer.rs:63

  2024-11-06T18:42:14.387598Z  WARN mirrord_kube::api::kubernetes::portforwarder: finished retry strategy, closing connection, connect_info: AgentKubernetesConnectInfo { pod_name: "mirrord-agent-wk5eyjv6hc-vllzt", agent_port: 44128, namespace: Some("dev-tyler"), agent_version: Some("3.122.1") }
    at mirrord/kube/src/api/kubernetes/portforwarder.rs:172

  2024-11-06T18:42:14.395494Z ERROR mirrord_intproxy::agent_conn: failed to receive message from the agent, inner task down
    at mirrord/intproxy/src/agent_conn.rs:226

  2024-11-06T18:42:14.396023Z ERROR mirrord_intproxy: task AGENT_CONNECTION failed: agent unexpectedly closed connection
    at mirrord/intproxy/src/lib.rs:243

  2024-11-06T18:42:14.396986Z ERROR mirrord::internal_proxy: Internal proxy encountered an error, exiting, error: Main internal proxy logic failed: agent unexpectedly closed connection
    at mirrord/cli/src/internal_proxy.rs:146

Steps to Reproduce

N/A hard to reproduce but happens randomly

Backtrace

No response

mirrord layer logs

No response

mirrord intproxy logs

No response

mirrord agent logs

No response

mirrord config

No response

mirrord CLI version

mirrord 3.122.1

mirrord-agent version

ghcr.io/metalbear-co/mirrord:3.122.1

mirrord-operator version (if relevant)

No response

plugin kind and version (if relevant)

No response

Your operating system and version

OSx Sonoma 14.4.1

Local process

nodeJS

Local process version

nodeJS v20.11.1

Additional Info

No response

linear[bot] commented 3 weeks ago

MBE-510 Proxy randomly closes during connection

aviramha commented 3 weeks ago

Hi, Looking at the error this seems to be an issue with kubectl port-forward (on the API level)

  2024-11-06T18:42:14.387598Z  WARN mirrord_kube::api::kubernetes::portforwarder: finished retry strategy, closing connection, connect_info: AgentKubernetesConnectInfo { pod_name: "mirrord-agent-wk5eyjv6hc-vllzt", agent_port: 44128, namespace: Some("dev-tyler"), agent_version: Some("3.122.1") }
    at mirrord/kube/src/api/kubernetes/portforwarder.rs:172

We know that kubectl port-forward tends to be flaky in some cases, but it's kinda out of our control (we even do retries as seen in the logs to handle it).

You can check your kube-apiserver if it restarted/had any error but other then that I think it's not something we can solve from our side.

P.S - The operator doesn't use kubectl port-forward (it implements its own API) and we didn't see a report of such issue for users of it - so perhaps that can be a solution.

DecathectZero commented 3 weeks ago

It seems like the agent just dies when this happens. Retried again but with mirrord 3.124.0

Same thing in mirrord-intproxy:

  2024-11-06T19:27:15.175003Z  WARN mirrord_kube::api::kubernetes::portforwarder: finished retry strategy, closing connection, connect_info: AgentKubernetesConnectInfo { pod_name: "mirrord-agent-nxavbqhb4z-6strh", agent_port: 32170, namespace: Some("dev-tyler"), agent_version: Some("3.122.1") }
    at mirrord/kube/src/api/kubernetes/portforwarder.rs:172

  2024-11-06T19:27:15.190863Z ERROR mirrord_intproxy::agent_conn: failed to receive message from the agent, inner task down
    at mirrord/intproxy/src/agent_conn.rs:226

  2024-11-06T19:27:15.191048Z ERROR mirrord_intproxy: task AGENT_CONNECTION failed: agent unexpectedly closed connection
    at mirrord/intproxy/src/lib.rs:243

  2024-11-06T19:27:15.193633Z ERROR mirrord::internal_proxy: Internal proxy encountered an error, exiting, error: Main internal proxy logic failed: agent unexpectedly closed connection
    at mirrord/cli/src/internal_proxy.rs:146
aviramha commented 3 weeks ago

Agent dying when connection ends (because port forward died) makes sense (desired logic). If agent crashed I would've expected it to send "bye" message. If you'd like to make sure it's not agent dying for some reason I'd tail the logs of the agent and see if you get any error besides "no connections alive, shutting down"

DecathectZero commented 3 weeks ago

Yup after investigating more port forward is definitely the culprit. Is there any good way around this at all? https://github.com/kubernetes/kubectl/issues/1525#issuecomment-2196452123

We can look into the operator, since it's not using port-forward

aviramha commented 3 weeks ago

Yup after investigating more port forward is definitely the culprit. Is there any good way around this at all? kubernetes/kubectl#1525 (comment)

We can look into the operator, since it's not using port-forward

Reading the issue you mentioned I don't really understand what's going on since the crate we use for kube uses websocket already, but maybe k8s does websocket -> spdy -> in the webserver ? in any case it looks like a longer term thing (we need kube crate to support the new connection if needed to change) and it might be using it already tbh

DecathectZero commented 2 weeks ago

This is still happening quite frequently on our end.

Before on Kubernetes 1.27 it was not noticeable, but after upgrading to Kubernetes 1.30 it's happening quite often and has been distruptive.

I've enabled logs for the Kubernetes apiserver via AWS EKS, but nothing significant has appeared there. I can send them over to you via a private channel if you want. Is there anything in particular that you are looking for in the kube-apiserver logs?

aviramha commented 4 hours ago

We found out that user used calico CNI +eBPF - need to retest.