nolar / kopf

A Python framework to write Kubernetes operators in just a few lines of code
https://kopf.readthedocs.io/
MIT License
2.15k stars 163 forks source link

peering depends on 'Credentials retriever' task but on shutdown that is stopped sooner #1033

Open asteven opened 1 year ago

asteven commented 1 year ago

Long story short

When using a ConnectionInfo with an expiration, which is required to work with expiring tokens, operator does not properly leave peering and does not exit.

It hangs as the updating/leaving the peering needs valid credentials but the vault ends up in a need_reauth state with no 'Credentials retriever' task left to populate the vault with new credentials.

Kopf version

main

Kubernetes version

any

Python version

3.10.10

Code

@kopf.on.login()
async def authenticate(
        *,
        logger: kopf.Logger,
        **_: Any,
) -> Optional[kopf.ConnectionInfo]:

    try:
        kubernetes_asyncio.config.load_incluster_config()  # cluster env vars
        logger.debug("Async client is configured in cluster with service account.")
    except kubernetes_asyncio.config.ConfigException as e1:
        try:
            await kubernetes_asyncio.config.load_kube_config()  # developer's config files
            logger.debug("Async client is configured via kubeconfig file.")
        except kubernetes_asyncio.config.ConfigException as e2:
            raise kopf.LoginError("Cannot authenticate the async client library "
                                         "neither in-cluster, nor via kubeconfig.")

    # We do not even try to understand how it works and why. Just load it, and extract the results.
    # For kubernetes client >= 12.0.0 use the new 'get_default_copy' method
    if callable(getattr(kubernetes_asyncio.client.Configuration, 'get_default_copy', None)):
        config = kubernetes_asyncio.client.Configuration.get_default_copy()
    else:
        config = kubernetes_asyncio.client.Configuration()

    # For auth-providers, this method is monkey-patched with the auth-provider's one.
    # We need the actual auth-provider's token, so we call it instead of accessing api_key.
    # Other keys (token, tokenFile) also end up being retrieved via this method.
    header: Optional[str] = config.get_api_key_with_prefix('BearerToken')
    parts: Sequence[str] = header.split(' ', 1) if header else []
    scheme, token = ((None, None) if len(parts) == 0 else
                     (None, parts[0]) if len(parts) == 1 else
                     (parts[0], parts[1]))  # RFC-7235, Appendix C.

    #expiration = datetime.datetime.utcnow() + datetime.timedelta(minutes=1)
    expiration = datetime.datetime.utcnow() + datetime.timedelta(seconds=10)
    #expiration = None
    return kopf.ConnectionInfo(
        server=config.host,
        ca_path=config.ssl_ca_cert,  # can be a temporary file
        insecure=not config.verify_ssl,
        username=config.username or None,  # an empty string when not defined
        password=config.password or None,  # an empty string when not defined
        scheme=scheme,
        token=token,
        certificate_path=config.cert_file,  # can be a temporary file
        private_key_path=config.key_file,  # can be a temporary file
        priority=1,
        expiration=expiration
    )

Logs

^C[2023-06-27 21:58:49,358] kopf._core.reactor.r [INFO    ] Signal SIGINT is received. Operator is stopping.
[2023-06-27 21:58:49,358] kopf._core.reactor.r [DEBUG   ] Admission mutating configuration manager is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Admission insights chain is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Namespace observer is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Credentials retriever is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Admission webhook server is cancelled.
[2023-06-27 21:58:49,359] kopf._core.reactor.r [DEBUG   ] Admission validating configuration manager is cancelled.
[2023-06-27 21:58:49,360] kopf._core.reactor.r [DEBUG   ] Poster of events is cancelled.
[2023-06-27 21:58:49,361] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for customresourcedefinitions.v1.apiextensions.k8s.io cluster-wide.
[2023-06-27 21:58:49,361] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for clusterkopfpeerings.v1.kopf.dev cluster-wide.
[2023-06-27 21:58:49,363] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for netcenterips.v1alpha1.netcenter.hpc.ethz.ch cluster-wide.
[2023-06-27 21:58:49,363] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for services.v1 cluster-wide.
[2023-06-27 21:58:49,363] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for ingresses.v1.networking.k8s.io cluster-wide.
[2023-06-27 21:58:49,363] kopf._core.reactor.r [DEBUG   ] Daemon killer is cancelled.
[2023-06-27 21:58:49,363] kopf._core.reactor.r [DEBUG   ] Resource observer is cancelled.
[2023-06-27 21:58:59,370] kopf._core.reactor.o [DEBUG   ] Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='peering keep-alive for default@None' coro=<guard() running at ./kopf/kopf/_cogs/aiokits/aiotasks.py:108> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/lib/python3.10/asyncio/tasks.py:864, Task.task_wakeup()]>>}
[2023-06-27 21:59:09,379] kopf._core.reactor.o [DEBUG   ] Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='peering keep-alive for default@None' coro=<guard() running at ./kopf/kopf/_cogs/aiokits/aiotasks.py:108> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/lib/python3.10/asyncio/tasks.py:864, Task.task_wakeup()]>>}
[2023-06-27 21:59:19,386] kopf._core.reactor.o [DEBUG   ] Streaming tasks are not stopped: finishing normally; tasks left: {<Task pending name='peering keep-alive for default@None' coro=<guard() running at ./kopf/kopf/_cogs/aiokits/aiotasks.py:108> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/lib/python3.10/asyncio/tasks.py:864, Task.task_wakeup()]>>}
... for a long time until finally
./run: line 8: 2849365 Killed                  kopf run --all-namespaces $@ ./handlers.py

Additional information

This only happens with peering enabled and when using a ConnectionInfo with expiration set.

asteven commented 1 year ago

Fixed by considering dependencies when shutting down tasks.

https://github.com/nolar/kopf/compare/main...asteven:kopf:cleaner_shutdown