Unexpected WebSocketConnectionClosedException when streaming long running commands, missing pings?

hterik commented 2 years ago

What happened (please include outputs or screenshots): After exactly 300sec of idle, exec commands streamed with stream loose their connection, even though

pod is still running,
The process that was execd in container is still running (checked with kubectl exec POD_NAME ps aux)
Network otherwise appears to be healthy.
No warnings logged by kube-apiserver

Following stack trace is given

  File "kubernetes/stream/ws_client.py", line 141, in readline_stderr
    return self.readline_channel(STDERR_CHANNEL, timeout=timeout)
  File "kubernetes/stream/ws_client.py", line 104, in readline_channel
    self.update(timeout=(timeout - time.time() + start))
  File "kubernetes/stream/ws_client.py", line 192, in update
    op_code, frame = self.sock.recv_data_frame(True)
  File "websocket/_core.py", line 401, in recv_data_frame
    frame = self.recv_frame()
  File "websocket/_core.py", line 440, in recv_frame
    return self.frame_buffer.recv_frame()
  File "websocket/_abnf.py", line 338, in recv_frame
    self.recv_header()
  File "websocket/_abnf.py", line 294, in recv_header
    header = self.recv_strict(2)
  File "websocket/_abnf.py", line 373, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "websocket/_core.py", line 524, in _recv
    return recv(self.sock, bufsize)
  File "websocket/_socket.py", line 122, in recv
    raise WebSocketConnectionClosedException(
websocket._exceptions.WebSocketConnectionClosedException: Connection to remote host was lost.

What you expected to happen:

Connection is not lost
If it is lost, connection is automatically reconnected (up to some configured timeout limit) Reading https://github.com/websocket-client/websocket-client#long-lived-connection, there is a description on how to use a built in way from the websocket library to do this. It appears that the stream command is not using this method.

How to reproduce it (as minimally and precisely as possible): (Mostly inspired by https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py)

from kubernetes.stream import stream
...
...
s = stream(
    _client.connect_get_namespaced_pod_exec,
    _pod.metadata.name,
    _pod.metadata.namespace,
    command=["sleep", "100000"],           
    async_req=False,
    stderr=True,
    stdin=False,
    stdout=True,
    tty=False,
    _preload_content=False,
)

def tail_logs():
    if (ret_stdout := s.readline_stdout(timeout=1)) is not None:
        print(ret_stdout)

while (retcode := s.returncode) is None:
    tail_logs(s)
tail_logs(s)

Anything else we need to know?: I've experimented by adding s.sock.ping() in each iteration of the tail_logs function and it appears to improve the situation but some times it still get stuck inside readline_stdout, despite the timeout=1 parameter being set. :raised_eyebrow: It get stuck so long that ping never has chance to run within the next 300sec timeout. Does one need to run ping in a separate thread for this to work? That's how it is done inside the WebSocketApp, as used by the long lived connection example from websocket library. Can one use WebSocketapp together with stream or do i need to implement similar ping-feature myself when using stream?

I'm also curious where 300sec come from, is there a way to find this time through connection handshake or is it a setting in my cluster/azure?

Environment:

Kubernetes version: Azure AKS v1.23.5
Python version 3.9.7
Python client version 23.6.0

brendandburns commented 2 years ago

You need to have TCP keep alives or some small amount of TCP traffic to keep the connection alive.

However, it's also worth noting that exec was never really meant for long-running connections, there are much better ways to achieve the same thing via other types of APIs.

brendandburns commented 2 years ago

More discussion here: https://stackoverflow.com/questions/50706483/what-azure-kubernetes-aks-time-out-happens-to-disconnect-connections-in-out

hterik commented 2 years ago

Thank you. Where does one enable TCP keep alive? This is normally a configuration on the socket from what i know, but such options are not exposed all the way up to the kubernetes_client apis. I can see that the WebSocket constructor takes sockopt argument as input, but the kubernetes.stream.ws_client.create_websocket does not set this argument.

Also curios about which alternative APIs you suggest that are better than exec? (I understand it's better that pods isolate their own workloads, but we have some scenarios where where we have to initiate longer operations via external triggers on already running containers, that take very long to load their dataset and re-initialize otherwise. Building our own application-level API in the container is something we want to avoid, and not really possible since the application is from third-party.) If exec is not supported, it would be good if the limitations using it was explained in docs.

Can add that we managed to get a stable solution by adding pings. We also had to replace the readline_stdout(timeout=1) with update() + peek_stdout() + read_stdout() instead, so exactly as the example in https://github.com/kubernetes-client/python/blob/master/examples/pod_exec.py, plus pings. While this works, it's very intricate and easy to get wrong, an easier solution would be good.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

hterik commented 2 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hterik commented 1 year ago

/remove-lifecycle rotten

jaredbischof commented 1 year ago

@hterik Are you able to share your solution? I am facing the same issue.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

hterik commented 1 year ago

/remove-lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hterik commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hterik commented 9 months ago

/remove-lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 4 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-client/python/issues/1841#issuecomment-2263301528): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

hterik commented 3 months ago

/reopen

k8s-ci-robot commented 3 months ago

@hterik: Reopened this issue.

In response to [this](https://github.com/kubernetes-client/python/issues/1841#issuecomment-2283397390): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-client/python/issues/1841#issuecomment-2343010564): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

hterik commented 2 months ago

/reopen /remove-lifecycle rotten

k8s-ci-robot commented 2 months ago

@hterik: Reopened this issue.

In response to [this](https://github.com/kubernetes-client/python/issues/1841#issuecomment-2343070978): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-client / python

Unexpected WebSocketConnectionClosedException when streaming long running commands, missing pings? #1841