kr8s-org / kr8s

A batteries-included Python client library for Kubernetes that feels familiar for folks who already know how to use kubectl
https://kr8s.org
BSD 3-Clause "New" or "Revised" License
797 stars 43 forks source link

410 response while waiting for jobs #476

Closed florianvazelle closed 3 weeks ago

florianvazelle commented 1 month ago

Which project are you reporting a bug for?

kr8s

What happened?

I'm experiencing an issue when waiting jobs with kr8s, like this:

def wait(label_selector, namespace):
    try:
        # Retrieve jobs
        jobs = kr8s.get(Job.kind, label_selector=label_selector, namespace=namespace)
        # Wait jobs
        for job in jobs:
            job.wait(["condition=Complete", "condition=Failed"])
        # Check jobs completion
        for job in jobs:
            if job.status.conditions[0].type == "Complete" and job.status.conditions[0].status == "True":
                logger.info(f"Job {job.name} completed")
    except box.exceptions.BoxValueError as e:
            logger.error(f"Failed to wait job completion {job.raw}: {repr(e)}")

And sometimes the code crash and logs:

Failed to wait job completion {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 546533173 (546535534)', 'reason': 'Expired', 'code': 410}: BoxValueError('Cannot extrapolate Box from string')

I can easily avoid the crash, but I wonder if it's expected to have a 410 response here.

I think it's come from the async_watch method that replace the raw attribute with this error response dict:

{'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 546533173 (546535534)', 'reason': 'Expired', 'code': 410}

I don't know if the resourceVersion, send when we call async_watch, is correctly updated for the next call !

Anything else?

I use kr8s 0.15.0

Some references from the kubernetes documentation:

jacobtomlinson commented 1 month ago

Thanks for raising this. It looks like we should be catching the 410 and restarting the watch with the latest resourceVersion.

When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned.

https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes