For context, we observed this issue in Kueue where we create a watcher, and every 30min (based on min-request-timeout param) an error was logged because the API server would close the watch. The error message logged by Kueue would be like:
{"level":"Level(-3)","ts":"2024-03-13T15:12:27.160910077Z","caller":"multikueue/multikueuecluster.go:204","msg":"Watch error","clusterName":"multikueue-test-worker1","watchKind":"jobset.x-k8s.io/v1alpha2, Kind=JobSet","status":"Failure","message":"an error on the server (\"unable to decode an event from the watch stream: context canceled\") has prevented the request from succeeding","reason":"InternalError"}
The API server closes the watch with 200 response, here is an example log line:
In response to [this](https://github.com/kubernetes/client-go/issues/1340#issuecomment-2025421614):
>/close
Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
For context, we observed this issue in Kueue where we create a watcher, and every
30min
(based onmin-request-timeout
param) an error was logged because the API server would close the watch. The error message logged by Kueue would be like:The API server closes the watch with 200 response, here is an example log line:
However, then this is presented by the watcher as an instance of
StatusInternalServerError
, which is wrong, because there was no "Server error".I think I traced it down (static code analysis):
AsObject
function which delegates toNewGenericServerResponse
: https://github.com/kubernetes/kubernetes/blob/03ce04584437624840ad78edac1b772e47e78dc2/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L843-L857NewGenericServerResponse
does the final wrapping (because 500 is the defualt): https://github.com/kubernetes/kubernetes/blob/03ce04584437624840ad78edac1b772e47e78dc2/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L487-L490Here is the analogous summary in the Kueue project: https://github.com/kubernetes-sigs/kueue/pull/1823#issuecomment-2022454868.
This seems to be covered by the
RetryWatcher
which retries theStatusInternalServerError
without leaving a log: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/watch/retrywatcher.go#L211-L214. So using,RetryWatcher
might mitigate it, but still emitting theStatusInternalServerError
events for gracefully closed connections is misleading and problematic for consumers of the vanilla watchers.