Protection of virtual hosts from "fat finger" deletion

michaelklishin commented 1 day ago

Problem Definition

Every so often we see users delete virtual hosts by accident. This can happen both over the HTTP API and with rabbitmqctl delete_vhost.

When this happen, certain kinds of virtual host data can be restored relatively easily. The definitions data set is small (compared to tens or hundreds of GiBs a stream can contain, for example). Unfortunately, for messages stored in queues (not streams), the backup and restore problem is a very different in scope.

Tanzu RabbitMQ has a Warm Standby Replication feature that replicates both schema and messages (for QQs, streams, CQs) to a remote warm standby cluster or multiple clusters. This does help with certain fat finger errors but not all of them, and only if a reasonably large schema synchronization interval is used.

Since virtual host deletion is one of the most destructive operations (up there with node reset), it could use some extra protection, similarly to how opt-in (experimental) feature flag enablement was made much more difficult to do accidentally in https://github.com/rabbitmq/rabbitmq-server/pull/11998, https://github.com/rabbitmq/rabbitmq-server/pull/12643.

In addition, specific clusters, systems, and RabbitMQ-based distributions such as Tanzu RabbitMQ can have system virtual host. "System" here means .

Proposed Solution

Virtual hosts can be marked with additional metadata. This metadata today stores the DQT (default queue type), a description, a and set of tags.

The same metadata map can be used to enable protection from deletion. If set, rabbitmqctl delete_vhost and DELETE /api/vhosts/{name} will fail with a reasonably specific error asking the user to delete the metadata key from the target virtual host first, then retry.

Deletion of the key will act as a confirmation for CLI tools, and will require the same permissions as DELETE /api/vhosts/{name} in the case of HTTP API.

Dedicated CLI commands and HTTP API endpoints will be provided for locking and unlocking deletion.

Currently Available Options

DELETE /api/vhosts/{name} requires the user to be tagged as administrator.

For CLI tools, such "roles" do not exist, so an explicit confirmation is particularly important to have.

sateeshkvk7 commented 1 day ago

Thank you, @michaelklishin, for developing/enhancing the extra-protection feature! Additionally, is it possible to add another feature option to duplicate virtual hosts? This would enable users to create exact copies of virtual hosts, including their configurations, messages, and resources. Such a feature would be incredibly useful for testing, backup, recovery, and migration, as it allows for modifications without affecting the original setup or provides a quick restore option.

Zerpet commented 1 day ago

I like this idea. I'm wondering about a "force" parameter both both rabbitmqctl delete_vhost and DELETE /api/vhosts/{name}. Similar how rm has rm -f flag. The user would be explicitly requesting JustDoIt ❗ By default, the protection, if present, will stop the deletion.

A "force" option could improve UX, because I only have to remember to add --force; whereas to remove the metadata, I have to remember 2 additional pieces of information: 1) remove vhost metadata command + 2) the metadata protection tag name. I hear the argument that a "force" option would partly defeat the purpose of this protection. I think it would be a matter of preference how robust/strict we want to make this protection.

michaelklishin commented 19 hours ago

@sateeshkvk7 a virtual host can have many GiBs of data, so "duplicating" it is not something we plan on doing. WSR in Tanzu RabbitmQ "continuously duplicates" a cluster and it took several person-years to get there.

This issue is for OSS RabbitMQ and only has to do with a deletion protection mechanism.

michaelklishin commented 19 hours ago

@Zerpet that's the point, make you remember and have to run an additional command. You have opted in for deletion protection after all. There isn't really a -f equivalent for the HTTP API.

For CLI tools, we can also add a -y or --confirm flag but that seems a bit pointless to me because it will become a part of all scripts, shell histories, muscle memory and in the end won't really have any effect.

Explicitly opting in for deletion protection and the requirement to explicitly lift it won't be used "as automatically".

Zerpet commented 18 hours ago

@Zerpet that's the point, make you remember and have to run an additional command. You have opted in for deletion protection after all. There isn't really a -f equivalent for the HTTP API.

For CLI tools, we can also add a -y or --confirm flag but that seems a bit pointless to me because it will become a part of all scripts, shell histories, muscle memory and in the end won't really have any effect.

I was thinking about an additional query parameter to support "force" delete via the HTTP API. I see your point, and I think it's reasonable. If we go down this route, it would be useful to have dedicated commands to add or remove the metadata protection tag. I agree with having to remember one command (to update vhost tags), but I think remembering the specific protection tag (in addition to the command) would lead to bad UX; specially if we expect this command to be used infrequently i.e. how often do you create/delete vhosts?

michaelklishin commented 17 hours ago

@Zerpet yes, dedicated CLI commands and HTTP API endpoints will be provided.

rabbitmq / rabbitmq-server