Open DiscordJim opened 3 months ago
The fix is to have multiple replicas for your coordinator node, or if you are using an overlord node replicas there instead.
I'd like to reopen this issue if that's okay for you as we should either document this or have the operator validate and warn about this scenario.
Sure, can you give me an example of one of these warnings? I would not mind opening a PR.
Affected Stackable version
24.3
Affected Apache Druid version
28.0.1
Current and expected behavior
After roughly 3-4 days, the router will display "No Management Proxy Node." It seems, from testing, that the error is that the router cannot connect to the coordinator. However, all services display healthy logs and there are no clear errors, nor error codes from the panel.
The difficulty to debug comes from the fact that there are no errors.
Possible solution
The only way we have to recover from this state is to restart all services.
Additional context
'["druid-kafka-indexing-service", "druid-datasketches", "prometheus-emitter", "druid-basic-security", "druid-opa-authorizer", "postgresql-metadata-storage", "druid-hdfs-storage", "druid-stats"]'
Environment
AKS
Would you like to work on fixing this bug?
None