nolar / kopf

A Python framework to write Kubernetes operators in just a few lines of code
https://kopf.readthedocs.io/
MIT License
2.09k stars 158 forks source link

Watchers on Custom Resources throw RuntimeError("Session is closed") and permanently die #980

Open jake-delorme opened 1 year ago

jake-delorme commented 1 year ago

Long story short

Operator starts up and runs normally as expected. After for running for some time some of the watch streams will throw a RuntimeError("Session is closed"). Once this happens that watch stream will never restart until the operator is restarted. This only appears to happen with custom resources (configmaps are fine).

Kopf version

1.35.6

Kubernetes version

v1.23.13 eks

Python version

3.9.13

Code

import kopf
import kubernetes

@kopf.on.login()
def login_fn(**kwargs):
    return kopf.login_via_client(**kwargs)

@kopf.on.startup()
def operator_configuration(settings: kopf.OperatorSettings, **kwargs):
    settings.watching.server_timeout = 300
    settings.watching.client_timeout = 300
    settings.peering.name = "operator"
    settings.peering.mandatory = True
    settings.peering.stealth = True

@kopf.on.create("MyCustomResource")
@kopf.on.update("MyCustomResource")
def mycustomresource(**_):
    print("Im Alive!")

Logs

[01:55:09,761] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 14.625ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:55:09,842] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b06945e0> starts SSL handshake
[01:55:10,002] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b06945e0>: SSL handshake took 159.6 ms
[01:55:10,003] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 41046), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b072b6a0>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b079ae80>)
[01:55:36,674] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[01:55:36,674] kopf.activities.auth [DEBUG   ] Activity 'login_fn' is invoked.
[01:55:36,751] asyncio              [DEBUG   ] <_SelectorSocketTransport fd=6 read=polling write=<idle, bufsize=0>> received EOF
[01:55:36,752] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b06f4820> received EOF
[01:55:36,754] kopf._cogs.clients.w [ERROR   ] Request attempt #1/9 failed; will retry: GET https://mycluster.eks.amazonaws.com/apis/kopf.dev/v1/namespaces/mynamespace/kopfpeerings?watch=true&resourceVersion=13763208&timeoutSeconds=300 -> ServerDisconnectedError('Server disconnected')
[01:55:37,484] kopf.activities.auth [DEBUG   ] Client is configured via kubeconfig file.
[01:55:37,485] kopf.activities.auth [INFO    ] Activity 'login_fn' succeeded.
[01:55:37,485] kopf._core.engines.a [INFO    ] Re-authentication has finished.
[01:55:37,489] asyncio              [DEBUG   ] Get address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32>
[01:55:37,490] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 0.790ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:55:37,571] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c57f0> starts SSL handshake
[01:55:37,730] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c57f0>: SSL handshake took 159.4 ms
[01:55:37,731] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 44038), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b078d460>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b059a640>)
[01:55:37,756] kopf._cogs.clients.w [DEBUG   ] Request attempt #2/9: GET https://mycluster.eks.amazonaws.com/apis/kopf.dev/v1/namespaces/mynamespace/kopfpeerings?watch=true&resourceVersion=13763208&timeoutSeconds=300
[01:55:37,756] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for kopfpeerings.v1.kopf.dev in 'mynamespace'.
[01:55:37,759] kopf._core.reactor.o [ERROR   ] Peering observer for operator@mynamespace has failed: Session is closed
Traceback (most recent call last):
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/venv/lib/python3.9/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/watching.py", line 82, in infinite_watch
    async for raw_event in stream:
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/watching.py", line 186, in continuous_watch
    async for raw_input in stream:
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/watching.py", line 251, in watch_objs
    async for raw_input in api.stream(
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
    response = await request(
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/api.py", line 78, in request
    response = await context.session.request(
  File "/venv/lib/python3.9/site-packages/aiohttp/client.py", line 400, in _request
    raise RuntimeError("Session is closed")
RuntimeError: Session is closed
[01:56:04,118] asyncio              [DEBUG   ] Get address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32>
[01:56:04,119] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 0.731ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:56:04,200] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c58b0> starts SSL handshake
[01:56:04,359] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c58b0>: SSL handshake took 158.7 ms
[01:56:04,360] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 47442), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b07622e0>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b06a14c0>)
[01:56:56,485] asyncio              [DEBUG   ] Get address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32>
[01:56:56,489] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 3.814ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:56:56,570] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c5f10> starts SSL handshake
[01:56:56,730] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c5f10>: SSL handshake took 159.4 ms
[01:56:56,730] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 53874), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b06a1b20>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b07624c0>)

Additional information

This is an EKS cluster and is using aws eks get-token in the kubeconfig to authenticate.

Using aiohttp version 3.8.3 using kubernetes client version: 24.2.0 aws-cli/2.2.43 Python/3.8.8 (used by kubeconfig)

Not all operators watch streams die at the same time. This is not running in a container on the cluster but on a server outside of aws.

alexnagy commented 1 year ago

Noticing the same behaviour on EKS with Python 3.11.2, aiohttp 3.8.4, kopf 1.36.0. In my case I am monitoring delete events for Job resources.

What seems to happen is: request fails because of token expiry -> goes to sleep because of backoff -> while sleeping the session gets closed by vault when invalidating previous credentials -> upon waking up and retrying the request it fails with RuntimeError: Session is closed because it still references the closed session.

Spacca commented 1 year ago

We are also noticing the same issues but in AKS.

Anybody found a workaround or solution for this problem? Maybe a way to include this in the probes?

python version 3.11.3 kopf version 1.36.1

nolar commented 1 year ago

What seems to happen is: request fails because of token expiry -> goes to sleep because of backoff -> while sleeping the session gets closed by vault when invalidating previous credentials -> upon waking up and retrying the request it fails with RuntimeError: Session is closed because it still references the closed session.

Thanks for this hint! Indeed, this is a possible scenario in some (rare?) cases of API failing several concurrent requests.

Can you please try a fix from https://github.com/nolar/kopf/pull/1031 — the branch is session-closed-in-reauth.

Depending on your packaging system, that can be installed like this (pip or requirements.txt; different for Poetry):

pip install git+https://github.com/nolar/kopf.git@session-closed-in-reauth

That would be Kopf 1.36.1 plus this fix only.

Spacca commented 1 year ago

Hi @nolar

I think the patch doesn't work, I have introduced a couple of probes to check whether the watcher is alive and it seems to always fail.

This happens to us regularly, maybe depending on cluster setup.

In particular i think it all started when we configured: automountServiceAccountToken=false

And started mounting the token using volumen projection with a duration of 3600s.

Seems like kopf doesn't handle the refresh well.

below my logs:

[2023-06-13 11:08:41,259] aiohttp.access       [INFO    ] 10.251.132.1 [13/Jun/2023:11:08:41 +0000] "GET /healthz HTTP/1.1" 200 229 "-" "kube-probe/1.26"
[2023-06-13 11:09:41,216] kopf.activities.prob [INFO    ] Retrieving all key for probe
[2023-06-13 11:09:41,238] kopf.activities.prob [INFO    ] Activity 'key' succeeded.
[2023-06-13 11:09:41,238] kopf.activities.prob [INFO    ] Retrieving all trust for probe
[2023-06-13 11:09:41,262] kopf.activities.prob [INFO    ] Activity 'trust' succeeded.
[2023-06-13 11:09:41,262] aiohttp.access       [INFO    ] 10.251.132.1 [13/Jun/2023:11:09:41 +0000] "GET /healthz HTTP/1.1" 200 229 "-" "kube-probe/1.26"
[2023-06-13 11:10:21,045] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[2023-06-13 11:10:21,046] kopf._cogs.clients.w [ERROR   ] Request attempt #1/9 failed; will retry: GET https://10.251.0.1:443/apis/com.acme/v1/trusts?watch=true&resourceVersion=1773126&timeoutSeconds=60 -> ClientOSError(1, '[SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2702)')
[2023-06-13 11:10:21,046] kopf._cogs.clients.w [ERROR   ] Request attempt #1/9 failed; will retry: GET https://10.251.0.1:443/apis/com.acme/v1/keys?watch=true&resourceVersion=1773126&timeoutSeconds=60 -> ClientOSError(1, '[SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2702)')
[2023-06-13 11:10:21,048] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2023-06-13 11:10:21,048] kopf._core.engines.a [INFO    ] Re-authentication has finished.
[2023-06-13 11:10:22,048] kopf._cogs.clients.w [ERROR   ] Request attempt #2/9 failed; will try re-authenticating: GET https://10.251.0.1:443/apis/com.acme/v1/trusts?watch=true&resourceVersion=1773126&timeoutSeconds=60
[2023-06-13 11:10:22,048] kopf._cogs.clients.w [ERROR   ] Request attempt #2/9 failed; will try re-authenticating: GET https://10.251.0.1:443/apis/com.acme/v1/keys?watch=true&resourceVersion=1773126&timeoutSeconds=60
[2023-06-13 11:10:22,049] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[2023-06-13 11:10:22,050] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2023-06-13 11:10:22,050] kopf._core.engines.a [INFO    ] Re-authentication has finished.
[2023-06-13 11:10:22,051] kopf._core.reactor.o [ERROR   ] Watcher for keys.v1.com.acme@none has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 78, in request
    response = await context.session.request(
  File "/opt/venv/lib/python3.10/site-packages/aiohttp/client.py", line 400, in _request
    raise RuntimeError("Session is closed")
RuntimeError: Session is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 94, in request
    raise errors.APISessionClosed("Session is closed.") from e
kopf._cogs.clients.errors.APISessionClosed: Session is closed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 47, in wrapper
    await vault.invalidate(key, exc=e)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 291, in invalidate
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2023-06-13 11:10:22,053] kopf._core.reactor.o [ERROR   ] Watcher for trusts.v1.com.acme@none has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 78, in request
    response = await context.session.request(
  File "/opt/venv/lib/python3.10/site-packages/aiohttp/client.py", line 400, in _request
    raise RuntimeError("Session is closed")
RuntimeError: Session is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 94, in request
    raise errors.APISessionClosed("Session is closed.") from e
kopf._cogs.clients.errors.APISessionClosed: Session is closed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 47, in wrapper
    await vault.invalidate(key, exc=e)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 291, in invalidate
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2023-06-13 11:10:41,216] kopf.activities.prob [INFO    ] Retrieving all key for probe
[2023-06-13 11:10:41,216] kopf.activities.prob [ERROR   ] Activity 'key' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 166, in get_keys
    await check_tasks("keys")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for keys resource type
[2023-06-13 11:10:41,217] kopf.activities.prob [INFO    ] Retrieving all trust for probe
[2023-06-13 11:10:41,217] kopf.activities.prob [ERROR   ] Activity 'trust' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 175, in get_trusts
    await check_tasks("trusts")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for trusts resource type
[2023-06-13 11:11:41,218] kopf.activities.prob [INFO    ] Retrieving all key for probe
[2023-06-13 11:11:41,218] kopf.activities.prob [ERROR   ] Activity 'key' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 166, in get_keys
    await check_tasks("keys")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for keys resource type
[2023-06-13 11:11:41,218] kopf.activities.prob [INFO    ] Retrieving all trust for probe
[2023-06-13 11:11:41,218] kopf.activities.prob [ERROR   ] Activity 'trust' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 175, in get_trusts
    await check_tasks("trusts")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for trusts resource type
[2023-06-13 11:11:51,842] kopf._core.reactor.r [ERROR   ] Resource observer has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/observation.py", line 127, in resource_observer
    await queueing.watcher(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 43, in wrapper
    async for key, info, context in vault.extended(APIContext, 'contexts'):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
    async for key, item in self._items():
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
    yielded_key, yielded_item = self.select()
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/bin/kopf", line 8, in <module>
    sys.exit(main())
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/kopf/cli.py", line 60, in wrapper
    return fn(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/kopf/cli.py", line 109, in run
    return running.run(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/running.py", line 81, in run
    asyncio.run(coro)
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/running.py", line 138, in operator
    await run_tasks(operator_tasks, ignored=existing_tasks)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/running.py", line 419, in run_tasks
    await aiotasks.reraise(root_done | root_cancelled | hung_done | hung_cancelled)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 238, in reraise
    task.result()  # can raise the regular (non-cancellation) exceptions.
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/observation.py", line 127, in resource_observer
    await queueing.watcher(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 43, in wrapper
    async for key, info, context in vault.extended(APIContext, 'contexts'):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
    async for key, item in self._items():
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
    yielded_key, yielded_item = self.select()
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/

These are my dependencies:

typing_extensions == 4.5.0
requests == 2.28.2
click == 8.1.3
jsonpickle == 3.0.1
pyinstaller == 5.8.0
git+https://github.com/nolar/kopf.git@session-closed-in-reauth
kubernetes == 26.1.0
pyOpenSSL == 23.0.0
pyjks == 20.0.0
pytest == 7.2.1
pytest-mock == 3.10.0
requests-mock == 1.10.0
importlib_resources == 5.12.0
dmitrytretyakov commented 1 month ago

Faced it as well, it's a big problem for us

@Spacca did you find some workaround or just live with that?

jake-delorme commented 4 weeks ago

We ended up loading the re-authentication hook with a very rudimentary check to force the operator to restart. For us this was every 10 mins. Mileage may vary.

@kopf.on.login()
async def update_credentials(logger, memo: kopf.Memo, **_):
    # This is a dirty hack to try and get the operator to restart when resource observers die
    count = 0
    intrestings_types = ("peering observer", "watcher for", "peering keep-alive", "resource observer", "pause-waiter for")
    for task in asyncio.all_tasks():
        task_type = " ".join(task.get_name().split()[:2])
        if task_type in intrestings_types:
            count += 1
    if memo.get("task_count", 0) <= count:
        logger.debug(f"Watching for {count} intresting tasks. previously {memo.get('task_count')}")
        memo["task_count"] = count
    else:
        logger.error(f"An intresting task went away (were {memo.get('task_count')} now {count}). Exiting the operator")
        exit(1)