kube-rs / kube

Rust Kubernetes client and controller runtime
https://kube.rs
Apache License 2.0
3.06k stars 320 forks source link

`kube::runtime::watcher` exits with `The resourceVersion for the provided watch is too old` #1615

Closed yorik closed 1 month ago

yorik commented 1 month ago

Current and expected behavior

When I'm trying to watch k8s events, watcher exits with error ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 } instead of re-trying connection.

Looks like this happens only for Event and probably when there is nothing going on in a namespace. The namespace from the example has bunch of deployments, but all of them with 0 replicas.

Possible solution

No response

Additional context

Code to prove the bug:

use anyhow::Result;
use futures::{StreamExt, TryStreamExt};
use k8s_openapi::api::core::v1::Event;
use kube::Client;
use kube::{
    api::Api,
    runtime::{watcher, WatchStreamExt},
};

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt()
        .with_max_level(tracing::Level::DEBUG)
        .with_target(false)
        .init();

    let client = Client::try_default().await?;
    let namespaces = vec!["prover-blue"];
    let mut tasks = vec![];

    for ns in namespaces {
        let client_clone = client.clone();
        let task = tokio::spawn(async move {
            let events: Api<Event> = Api::namespaced(client_clone, ns);

            let mut stream = watcher(events, Default::default())
                .applied_objects()
                .boxed();

            while let Some(event) = stream.try_next().await.unwrap() {
                println!(
                    "Event in namespace {}: reason: {:?}, message: {:?}",
                    ns, event.reason, event.message
                );
            }

            Ok::<_, anyhow::Error>(())
        });

        tasks.push(task);
    }

    futures::future::try_join_all(tasks).await?;
    Ok(())
}

Log before exits:

2024-10-24T16:13:57.172693Z DEBUG service.ready=true processing request
2024-10-24T16:13:57.172726Z DEBUG service.ready=true message=processing request    
2024-10-24T16:13:57.172914Z DEBUG ++ HTTP; http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" span=274877906945    
2024-10-24T16:13:57.172934Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting
2024-10-24T16:13:57.172962Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting    
2024-10-24T16:13:57.173036Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)
2024-10-24T16:13:57.173056Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)    
2024-10-24T16:13:57.294799Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" http.status_code=200}: HTTP; http.status_code=200 span=274877906945    
2024-10-24T16:18:47.297228Z DEBUG pooling idle connection for ("https", 10.0.0.1)
2024-10-24T16:18:47.297350Z DEBUG pooling idle connection for ("https", 10.0.0.1)    
2024-10-24T16:18:47.297887Z DEBUG service.ready=true processing request
2024-10-24T16:18:47.298043Z DEBUG service.ready=true message=processing request    
2024-10-24T16:18:47.298409Z DEBUG ++ HTTP; http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" span=549755813889    
2024-10-24T16:18:47.298448Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting
2024-10-24T16:18:47.298492Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: requesting    
2024-10-24T16:18:47.298641Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)
2024-10-24T16:18:47.298684Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client"}: reuse idle connection for ("https", 10.0.0.1)    
2024-10-24T16:18:47.459567Z DEBUG HTTP{http.method=GET http.url=https://10.0.0.1/api/v1/namespaces/prover-blue/events?&watch=true&timeoutSeconds=290&allowWatchBookmarks=true&resourceVersion=67611575 otel.name="watch" otel.kind="client" http.status_code=200}: HTTP; http.status_code=200 span=549755813889    
2024-10-24T16:18:47.527182Z DEBUG error watchevent error: ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 }
2024-10-24T16:18:47.527297Z DEBUG error watchevent error: ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 }    
thread 'tokio-runtime-worker' panicked at src/main.rs:153:61:
called `Result::unwrap()` on an `Err` value: WatchError(ErrorResponse { status: "Failure", message: "The resourceVersion for the provided watch is too old.", reason: "Expired", code: 410 })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-10-24T16:18:47.528049Z DEBUG pooling idle connection for ("https", 10.0.0.1)
2024-10-24T16:18:47.528115Z DEBUG pooling idle connection for ("https", 10.0.0.1)    
Error: task 14 panicked with message "called `Result::unwrap()` on an `Err` value: WatchError(ErrorResponse { status: \"Failure\", message: \"The resourceVersion for the provided watch is too old.\", reason: \"Expired\", code: 410 })"

Environment

$ kubectl version        
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.31.1-gke.1000000

GKE Debian sid

Configuration and features

k8s-openapi = { version = "0.23.0", features = ["latest"] }
kube = { version = "0.96.0", features = ["runtime"] }

Affected crates

kube-runtime

Would you like to work on fixing this bug?

maybe

clux commented 1 month ago

Perhaps this is a bit counter-intuitive, but watcher::Error is an error type that we mostly propagate for logging and our own recovery. The errors may occur, but the errors in that enum are generally recoverable.

The problem is your unwrap(). You can run it with a for_each (as e.g. this) and it should recover (the watcher will relist from rv=0).

nightkr commented 1 month ago

Hey, there are really several separate issues going on here:

  1. Panicking on errors: stream.try_next().await.unwrap()

This will panic on the first error received. watcher is designed to retry in the face of errors, but will still propagate them to you as well. However, unwrap() will cause it to panic (and implicitly terminate the watch due to unwinding).

Instead, you could do something like:

            while let Some(event) = stream.try_next().await {
                match event {
                    Ok(event) => println!(
                        "Event in namespace {}: reason: {:?}, message: {:?}",
                        ns, event.reason, event.message
                    ),
                    Err(err) => eprintln!("Error during watch: {err:?}"),
                }
            }
  1. The connection cannot be resumed

When reconnecting, watcher will try to pick up the stream based on the timestamp (resourceVersion) of last message it received (resourceVersion).

However, Kubernetes only keeps the replay cache for so long, and will reject attempts to reconnect if the resourceVersion is too old. This can happen if the newest message is older than the size of the replay cache. kube tries to work around this using BOOKMARK messages (https://docs.rs/kube/latest/kube/runtime/watcher/struct.Config.html#structfield.bookmarks), which requests that the API server should send periodic empty messages with just the latest timestamp during idle periods. It may be worth investigating whether these bookmark messages are not being sent/interpreted for some reason.

yorik commented 1 month ago

I didn't expect that I should just continue stream.try_next(), but it works! Thank you so much for the explanation!

Maybe https://github.com/kube-rs/kube/blob/main/examples/event_watcher.rs#L48 should be updated with the match?

nightkr commented 1 month ago

Updated the example in #1616.

yorik commented 1 month ago

Thank you! Looks good to me.