kubernetes / kubectl

Issue tracker and mirror of kubectl code
Apache License 2.0
2.89k stars 924 forks source link

kubectl logs --follow shouldn't timeout or exit non-zero otherwise #1373

Closed dkarlovi closed 1 year ago

dkarlovi commented 1 year ago

What happened?

kubectl logs -f pod

with a pod silent for a while, say 10 mins, after a while it gets disconnected.

What did you expect to happen?

The logs are streamed indefinitely.

Alternatively, if a timeout is the cause of the interruption, exit with non-zero status.

How can we reproduce it (as minimally and precisely as possible)?

Run a pod which does something like

date
sleep 3600
date

and watch its logs with logs -f

Anything else we need to know?

No response

Kubernetes version

```console $ kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"archive", BuildDate:"2023-01-19T00:00:00Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"804d6167111f6858541cef440ccc53887fbbc96a", GitTreeState:"clean", BuildDate:"2022-12-11T00:17:11Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider

Azure AKS

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dkarlovi commented 1 year ago

/sig cli

ardaguclu commented 1 year ago

/transfer kubectl

mpuckett159 commented 1 year ago

/triage accepted We would like to implement a non-zero exit code for a timeout/failure for sure, we will investigate the ability to ignore timeouts due to inactivity.

XDRAGON2002 commented 1 year ago

Hey, would this be a good issue for a new contributor? If so, I would like to try and give it a shot.

mpuckett159 commented 1 year ago

So I actually think that the root of the error here is that we have an --ignore-errors flag, but it defaults to true all the time. I haven't checked further than just skimming through the logs code really quick, however if we look here we see the definition of the flag itself, and I'm pretty sure it defaults to true, meaning that the only way to not ignore errors would be passing --ignore-errors=false I believe. Needs more investigation, but hopefully should be relatively straight forward if you want to give it a shot.

XDRAGON2002 commented 1 year ago

@mpuckett159 sure, I'll give it a try, thanks!

I'll look into changing the default to false instead of true.

/assign

dkarlovi commented 1 year ago

@mpuckett159 do you still think worthwhile to do some research if it's possible to workaround timeouts?

mpuckett159 commented 1 year ago

Personally yes however it will likely be very low on the priority list, and honestly this will likely be very difficult to diagnose the specific root cause of because of the sheer number of "things" that could be generating the disconnection.

dkarlovi commented 1 year ago

@mpuckett159 sounds good, maybe we can reword this issue then to focus on the exit code and add a new one to focus on the timeout which might or might not be fixed in the future.

XDRAGON2002 commented 1 year ago

Hey @mpuckett159 I had a small query, should I open the PR with the fix in this repo or in the main k8s one?

mpuckett159 commented 1 year ago

Yes, all PRs for fixing code in kubectl should be opened in the main kubernetes/kubernetes repo, and code changes should be done in the staging directory. Also be sure to sign the CLA. You can find more details on contributing and your first contribution here.

XDRAGON2002 commented 1 year ago

Thanks @mpuckett159 for letting me know! Opened a PR fixing the same.

mpuckett159 commented 1 year ago

Excellent I'll take a look.

brianpursley commented 1 year ago

By the way, I don't think --ignore-errors actually defaults to true, unless I am reading this wrong:

$ kubectl logs --help | grep -A1 ignore-errors
    --ignore-errors=false:
    If watching / following pod logs, allow for any errors that occur to be non-fatal

I'm pretty sure cobra uses whatever the value of the variable is, and o.IgnoreLogErrors appears to be an uninitialized bool when this runs:

    cmd.Flags().BoolVar(&o.IgnoreLogErrors, "ignore-errors", o.IgnoreLogErrors, "If watching / following pod logs, allow for any errors that occur to be non-fatal")
brianpursley commented 1 year ago

@dkarlovi. Can you provide more details on this?

I am trying to reproduce and am unable. I am using my own cluster and not Azure AKS, so maybe there is something provider-specific. I am also using a newer client and server, so that could be a factor too.

Here is what I've tried:

$ kubectl run bb --image=busybox --command -- /bin/sh -c "date; sleep 3600; date"
$ kubectl logs -f bb
Fri May 12 23:20:56 UTC 2023

... and it runs for over 15 minutes, and I finally decided to try to make it exit, so I ran kubectl edit pod bb and changed the image to busybox:1.35, then my kubectl logs command exits with exit code 0. It terminated normally because the pod stopped (restarted).


Now, if I force request timeout to something short, I get a timeout error and a nonzero exit code...

$ kubectl logs -f bb --request-timeout=5
Fri May 12 23:37:28 UTC 2023
error: net/http: request canceled (Client.Timeout or context cancellation while reading body)
$ echo $?
1

...but I don't think this is what this issue is referring to.

Any additional info would be welcome.

dkarlovi commented 1 year ago

@brianpursley maybe something in the release fixed it in the meantime, let me retry and update this issue, will get back to you here on Monday. Thanks!

dkarlovi commented 1 year ago

@brianpursley you're right, I can't reproduce it any longer either, it streamed the empty log for 1h correctly. Closing here, thanks for looking into this!

dkarlovi commented 10 months ago

BTW if anyone ends up finding this issue, kubectl 1.25.0 did have this issue for me, but 1.29 did not, everything else being equal. It was fixed at some point after 1.25.0 but before 1.29.0, I don't know how.