rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
11.85k stars 3.9k forks source link

Revert "rabbit_feature_flags: Retry after erpc:call() fails with `noconnection`" #11507

Closed dumbbell closed 2 weeks ago

dumbbell commented 2 weeks ago

This reverts commit 8749c605f5d37112529df565201507f1bd4b19ae.

Why

The patch was supposed to solve an issue that we didn't understand and that was likely a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running.

The retry in the Feature flags subsystem was not the right place either. The noconnection error is visible there because it runs earlier during RabbitMQ startup. But retrying there won't solve a network issue magically.

There are two ways to create a cluster:

  1. peer discovery and this subsystem takes care of retries if necessary and appropriate
  2. manually using the CLI, in which case the user is responsible for starting RabbitMQ nodes and clustering them

Let's revert it until the root cause is really understood.

kjnilsson commented 2 weeks ago

Additional observations:

  1. a clearly wrong command like rabbitmqctl join_node this-is-not-a-node@argh takes over a minute to return. This is bad UX.
  2. It makes an already slow test suite: clustering_managment_SUITE take longer than it needs to. There are two tests that test this functionality and each take over a minute to complete.