No Management Proxy Node: Coordinator randomly goes down

DiscordJim commented 3 months ago

Affected Stackable version

24.3

Affected Apache Druid version

28.0.1

Current and expected behavior

After roughly 3-4 days, the router will display "No Management Proxy Node." It seems, from testing, that the error is that the router cannot connect to the coordinator. However, all services display healthy logs and there are no clear errors, nor error codes from the panel.

The difficulty to debug comes from the fact that there are no errors.

Possible solution

The only way we have to recover from this state is to restart all services.

Additional context

Extensions: '["druid-kafka-indexing-service", "druid-datasketches", "prometheus-emitter", "druid-basic-security", "druid-opa-authorizer", "postgresql-metadata-storage", "druid-hdfs-storage", "druid-stats"]'
Deep Storage: HDFS
Metadata Store: Postgres

Environment

AKS

Would you like to work on fixing this bug?

None

DiscordJim commented 3 months ago

The fix is to have multiple replicas for your coordinator node, or if you are using an overlord node replicas there instead.

lfrancke commented 1 month ago

I'd like to reopen this issue if that's okay for you as we should either document this or have the operator validate and warn about this scenario.

DiscordJim commented 1 month ago

Sure, can you give me an example of one of these warnings? I would not mind opening a PR.

stackabletech / druid-operator