Unable to forget_cluster_node with RabbitMQ 3.8.16 on Erlang 23.3.1 {:no_exists, :rabbit_node_maintenance_states}

unipolar commented 3 years ago

In our test scenarios we are resizing rabbitmq cluster from 1 to 3 nodes, and back to 1. Cluster downsize is made with rabbitmqctl stop_app/reset/forget_cluster_node. Rarely we get strange inconsistent state with RabbitMQ 3.8.16 on Erlang 23.3.1 that doesn't allow to forget nodes. A cluster looks quite normal before stop_app/reset is executed for 2 rabbitmq nodes.

# # all commands were executed on working  rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain 

# rabbitmqctl cluster_status  
Cluster status of node rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain ...
Basics

Cluster name: hci

Disk Nodes

rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain
rabbit@management.1bfcae44be1a4192.nodes.svc.vstoragedomain
rabbit@management.2e95ed3152724758.nodes.svc.vstoragedomain

Running Nodes

rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain

Versions

rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain: RabbitMQ 3.8.16 on Erlang 23.3.1

Maintenance status

Node: rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain, status: not under maintenance

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@management.179230f3680546ff.nodes.svc.vstoragedomain, interface: 10.10.116.113, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0

Feature flags

Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled

# rabbitmqctl forget_cluster_node rabbit@management.1bfcae44be1a4192.nodes.svc.vstoragedomain                                                                     

Removing node rabbit@management.1bfcae44be1a4192.nodes.svc.vstoragedomain from the cluster
Error:
{:failed_to_remove_node, :"rabbit@management.1bfcae44be1a4192.nodes.svc.vstoragedomain", {:no_exists, :rabbit_node_maintenance_states}}

We haven't seen this with RabbitMQ 3.7.14, is it a known issue?

johanrhodin commented 2 years ago

We (CloudAMQP) have seen the same thing (Rabbitmq_version 3.8.21, Erlang 24.0.4). I did a bit of spelunking and it looked like it had something to do with the communication of the feature flags, but couldn't reproduce it.

NB: Discussions have moved to: https://github.com/rabbitmq/rabbitmq-server/discussions/

johanrhodin commented 2 years ago

A root cause was reported on the email list: https://groups.google.com/g/rabbitmq-users/c/ICy8b8UzOr0/m/ylI6EaVVCQAJ

TheGOro commented 2 years ago

I would like to know what is the actual rational or consideration behind the current implementation of the maintenance status table storage principles, particularly in a cluster environment. As of now, the table is held only on a single node as a ram copy, thus it is not replicated and the transactions are not written to disk. Records will be lost upon node restart, and the whole table will vanish when mnesia got reset on that node which holds the ram copy. All of that makes the whole maintenance feature fragile, and you have to introduce additional (undocumented) measures to ensure that the functionality remains operational over a long period of time.

michaelklishin commented 2 years ago

Node maintenance state is inherently transient and is intentionally reset on node boot (and cannot be determined when the node is stopped). So if the contents are inherently transient, it doesn't matter much if the table itself is durable or not. Once 4.0 ships and schema database becomes much easier to introduce changes to, this state can use a durable table but it won't change any of the transient semantics.

Every node that has the feature flag enabled gets a copy of that table.

I don't see how this makes the feature "fragile". It was meant to be used in only a couple of ways: to mark a node for maintenance before it is shut down by a rolling upgrade mechanism, and unmark if that mechanism cannot proceed. It's not meant to do much.

What are those "undocumented measures"?

TheGOro commented 2 years ago

Node maintenance state is inherently transient and is intentionally reset on node boot (and cannot be determined when the node is stopped). So if the contents are inherently transient, it doesn't matter much if the table itself is durable or not. Once 4.0 ships and schema database becomes much easier to introduce changes to, this state can use a durable table but it won't change any of the transient semantics.

This transient semantic is all okay and would be inline with the intention to use ram_copies only, but the ensure_table_copy is contradicting in this sense as it will try to create a disc_copy if the given node holds no other type of copy of that table.

Every node that has the feature flag enabled gets a copy of that table.

That's not really the case, as the "enable" clause of the migration function is called only once when the table itself does not exist. Thus, it is created as a ram copy (here ensure_table_copy is a no-op, because you cannot change the storage type in this way) on the first node in the cluster, the rest will happily comply at join.

I don't see how this makes the feature "fragile". It was meant to be used in only a couple of ways: to mark a node for maintenance before it is shut down by a rolling upgrade mechanism, and unmark if that mechanism cannot proceed. It's not meant to do much.

We are utilizing the maintenance mode during our rolling upgrade and there we encountered issues as we are monitoring whether the transition after the rabbitmq-upgrade drain invocation confirmed by the instance in the status information. Something like: rabbitmqctl eval '{_,V} = lists:keyfind(is_under_maintenance, 1, rabbit:status()),V.'

What are those "undocumented measures"?

As a mitigation, we are executing rabbitmqctl eval 'case rabbit_core_ff:maintenance_mode_status_migration(maintenance_mode_status_migration, nil, is_enabled) of false -> rabbit_core_ff:maintenance_mode_status_migration(maintenance_mode_status_migration, nil, enable); _ -> ok end.' command before committing the drainage. To eliminate this workaround, I'm looking into options to hook in something similar when a given node going though a supervised cluster join process, to ensure that the table is recreated.

michaelklishin commented 2 years ago

@TheGOro we never recommended those "undocumented measures", this is something that you decided to do at your own risk via arbitrary code evaluation.

Do you have a specific suggestion as to what can be improved? If it's safe for upgrades, we would consider it.

Somehow VMware's products do not use any of the eval workarounds. They also don't verify whether a node is under maintenance because that's usually not required. The goal is to preemptively make nodes offload their clients and queue replicas first and foremost. The design we have is good enough for that.

TheGOro commented 2 years ago

The maintenance check itself seems superfluous on our side, considering the fact the the drain is not async and if it is returns successfully within the set timeout, then the node can be brought down safely. Due to how the ff registry behaves and that it consolidates the state of the features in the runtime compiled module, it actually makes possible to use the maintenance module and the drain function even if the table is missing.

Regarding to ideas and suggestions, I don't have anything concrete in my mind just different possibilities to approach and tackle the problem from various levels. Actually, one of the reason why I decided to chime in on this thread is to gather some information about the intended design, to handle it on cluster resource management level. However, I can go further and do some experimental changes on the rabbit module level.

michaelklishin commented 2 years ago

One reason why the current design works perfectly fine for some VMware products is that deal with one node at a time.

We by no means consider that the current design is perfect, it's just good enough. We can switch to a durable replicated table (in particular once Khepri ships with RabbitMQ 3.11 as an emerging subsystem only a few features use) or even easier, make every node re-create the table if it detects that the feature flag is enabled on boot.

TheGOro commented 2 years ago

or even easier, make every node re-create the table if it detects that the feature flag is enabled on boot.

That's what I'm thinking about and considering as a quick and not that dirty solution to start experiment with. But even having a ram copy on all the nodes wouldn't really hurt anyone considering the crazy amount of transactions and the extent of data that such table will receive during its lifetime.

michaelklishin commented 2 years ago

The table obviously will be very lightly loaded and store very few rows :) but every distributed table complicates upgrades, and can break mixed version cluster support easily if you are not careful about how it is introduced. Hence the feature flag.

michaelklishin commented 2 years ago

I'd be happy to make the change if we agree that it would be an improvement. At least it would make the design more logical. But the tooling that automates deployments would still have to use it with every node individually, one at a time.

TheGOro commented 2 years ago

That would absolutely be an improvement. Just notify me if you have something and I can check how it behaves in our fault reproduction test case to give feedback.

michaelklishin commented 2 years ago

@TheGOro filed https://github.com/rabbitmq/rabbitmq-server/issues/4755.

rabbitmq / discussions

Unable to forget_cluster_node with RabbitMQ 3.8.16 on Erlang 23.3.1 {:no_exists, :rabbit_node_maintenance_states} #183