kubernetes / client-go

Go client for Kubernetes.
Apache License 2.0
8.99k stars 2.93k forks source link

Watcher emits StatusInternalServerError when API server closes a watch gracefully with 200 #1340

Closed mimowo closed 6 months ago

mimowo commented 6 months ago

For context, we observed this issue in Kueue where we create a watcher, and every 30min (based on min-request-timeout param) an error was logged because the API server would close the watch. The error message logged by Kueue would be like:

{"level":"Level(-3)","ts":"2024-03-13T15:12:27.160910077Z","caller":"multikueue/multikueuecluster.go:204","msg":"Watch error","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","status":"Failure","message":"an error on the server (\"unable to decode an event from the watch stream: context canceled\") has prevented the request from succeeding","reason":"InternalError"}

The API server closes the watch with 200 response, here is an example log line:

1 httplog.go:132] "HTTP" verb="WATCH" URI="/apis/jobset.x-k8s.io/v1alpha2/jobsets?labelSelector=kueue.x-k8s.io%2Fmultikueue-origin%3Dmultikueue&watch=true" latency="34.525929114s" userAgent="manager/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="55d611cc-e3b5-4643-b5b1-92182703a262" srcIP="192.168.8.1:36440" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_init_latency="320.89µs" apf_execution_time="323.26µs" resp=200

However, then this is presented by the watcher as an instance of StatusInternalServerError, which is wrong, because there was no "Server error".

I think I traced it down (static code analysis):

  1. the client creates a new watcher with InternalError (500) as the default: https://github.com/kubernetes/kubernetes/blob/03ce04584437624840ad78edac1b772e47e78dc2/staging/src/k8s.io/client-go/rest/request.go#L789-L791
  2. the watcher wraps the closed connection with the error that we see in the logs: https://github.com/kubernetes/kubernetes/blob/03ce04584437624840ad78edac1b772e47e78dc2/staging/src/k8s.io/apimachinery/pkg/watch/streamwatcher.go#L120
  3. this is further wrapped by the AsObject function which delegates to NewGenericServerResponse: https://github.com/kubernetes/kubernetes/blob/03ce04584437624840ad78edac1b772e47e78dc2/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L843-L857
  4. The NewGenericServerResponse does the final wrapping (because 500 is the defualt): https://github.com/kubernetes/kubernetes/blob/03ce04584437624840ad78edac1b772e47e78dc2/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L487-L490

Here is the analogous summary in the Kueue project: https://github.com/kubernetes-sigs/kueue/pull/1823#issuecomment-2022454868.

This seems to be covered by the RetryWatcher which retries the StatusInternalServerError without leaving a log: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/watch/retrywatcher.go#L211-L214. So using, RetryWatcher might mitigate it, but still emitting the StatusInternalServerError events for gracefully closed connections is misleading and problematic for consumers of the vanilla watchers.

mimowo commented 6 months ago

Duplicate of https://github.com/kubernetes/kubernetes/issues/124098, I wasn't sure where to open the issue

alculquicondor commented 6 months ago

/close

k8s-ci-robot commented 6 months ago

@alculquicondor: Closing this issue.

In response to [this](https://github.com/kubernetes/client-go/issues/1340#issuecomment-2025421614): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.