vmware-tanzu-labs / educates-training-platform

A platform for hosting interactive workshop environments in Kubernetes, or on top of a local container runtime.
https://docs.educates.dev
Apache License 2.0
72 stars 18 forks source link

DNS not ready causes operator crash loop backoff. #359

Closed GrahamDumpleton closed 4 months ago

GrahamDumpleton commented 4 months ago

Describe the bug

When the session manager operator starts up, if it is deployed to a freshly started node, then cluster DNS resolution may not be ready and DNS lookup while setting up configuration may fail causing pod to fail. If the DNS server takes a little while to be available, this could result in the operator going into crash loop back off.

Traceback (most recent call last):
  File "/opt/app-root/src/main.py", line 16, in <module>
    from handlers import workshopenvironment
  File "/opt/app-root/src/handlers/workshopenvironment.py", line 10, in <module>
    from .objects import create_from_dict, Workshop, SecretCopier
  File "/opt/app-root/src/handlers/objects.py", line 4, in <module>
    from .operator_config import OPERATOR_API_GROUP
  File "/opt/app-root/src/handlers/operator_config.py", line 44, in <module>
    CLUSTER_DOMAIN = socket.getaddrinfo("kubernetes.default.svc", 0, flags=socket.AI_CANONNAME)[0][3]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/socket.py", line 963, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution

Would be better if startup code check DNS resolution is working before continuing, only giving up and failing pod if still not ready after 60 seconds.

Additional information

No response