k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
161 stars 75 forks source link

Cassandra pod stuck instead of CrashLoopBackOff when Medusa config loading fails #1406

Closed c3-clement closed 1 week ago

c3-clement commented 2 weeks ago

What happened?

I deployed a K8ssandraCluster with 96 replicas and medusa enabled, and one of the pods did not reach the Readiness probe

k get sts cs-95f5cdf50d-cs-95f5cdf50d-default-sts -n platform
NAME                                      READY   AGE
cs-95f5cdf50d-cs-95f5cdf50d-default-sts   95/96

I identified the faulty pod: It was not reaching readiness probe because of the medusa container. The medusa gRPC server did not start because load_config() failed (see logs below). Since the gRPC server was not started, the readiness probe was not reached.

The medusa container was "blocked" and did not attempt to restart the gRPC server. I restarted the pod manually by deleting it, and the medusa gRPC server started successfully.

Did you expect to see something different?

I expect the pod to restart and to be in CrashLoopBackOff phase if a uncaught exception is raised by the medusa python process, instead of blocking indefinitely.

I believe this behavior was introduced by the following change : https://github.com/thelastpickle/cassandra-medusa/pull/731

How to reproduce it (as minimally and precisely as possible): Start the medusa container with an invalid configuration

Environment

Medusa logs

MEDUSA_MODE = GRPC
sleeping for 0 sec
Starting Medusa gRPC service
WARNING:root:The CQL_USERNAME environment variable is deprecated and has been replaced by the MEDUSA_CQL_USERNAME variable
WARNING:root:The CQL_PASSWORD environment variable is deprecated and has been replaced by the MEDUSA_CQL_PASSWORD variable
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 424, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 419, in main
    server = Server(config_file_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 53, in __init__
    self.medusa_config = self.create_config()
                         ^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/service/grpc/server.py", line 88, in create_config
    return load_config(args, config_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/config.py", line 315, in load_config
    config = parse_config(args, config_file)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/config.py", line 280, in parse_config
    config.set('storage', 'fqdn', hostname_resolver.resolve_fqdn())
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/network/hostname_resolver.py", line 48, in resolve_fqdn
    hostname = self.compute_k8s_hostname(ip_address_to_resolve)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/medusa/network/hostname_resolver.py", line 56, in compute_k8s_hostname
    fqdns = dns.resolver.resolve(reverse_name, 'PTR')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 1565, in resolve
    return get_default_resolver().resolve(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 1307, in resolve
    (request, answer) = resolution.next_request()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cassandra/.venv/lib/python3.11/site-packages/dns/resolver.py", line 749, in next_request
    raise NXDOMAIN(qnames=self.qnames_to_try, responses=self.nxdomain_responses)
dns.resolver.NXDOMAIN: The DNS query name does not exist: 92.49.20.172.in-addr.arpa.

┆Issue is synchronized with this Jira Story by Unito

c3-clement commented 1 week ago

Resolved with https://github.com/thelastpickle/cassandra-medusa/issues/805