PKI Secret Engine auto-tidy

ser6iy commented 1 year ago

The wrong concept was used from the beginning, it needs to be redone.

PKI Secret Engine documentation for auto-tidy (https://developer.hashicorp.com/vault/api-docs/secret/pki#configure-automatic-tidy) has a parameter interval_duration (https://developer.hashicorp.com/vault/api-docs/secret/pki#interval_duration). This needs to explicitly call out the default value to be 12 hours.

[interval_duration](https://developer.hashicorp.com/vault/api-docs/secret/pki#interval_duration) (string: "") - Specifies the duration between automatic tidy operations; note that this is from the end of one operation to the start of the next so the time of the operation itself does not need to be considered.

Since the next cleaning starts at an interval_duration after the end of the previous one, over time, its start will be shifted, and will be released during business hours when the Vault is already loaded with users.

With a significant number of PKI certificates stored, the Vault heavily loads the backend storage with read operations and uses much more than usual (x10) RAM for indexing or checking them.

Therefore, for these tasks, the approach as in the cron is more optimal. Need to specify the day and time to start cleaning, for example, on Friday at 11pm, and then cleaning will go on all weekend, without interfering with the main work and going faster because there is no load or it is much less from users. And add a check that if the cleaning is already in progress (there can be a lot of certificates, they missed the previous cleaning or something else), then do not start a new one.

cipherboy commented 1 year ago

@ser6iy This is indeed an issue, and a bit of a hard problem...

Presently, PKI tidies are expensive for a couple of reasons:

(RAM) There can be many certificates stored within a mount. Paginated lists, a desired future enhancement to Vault, could help here (as we wouldn't store all certificates in memory), but so could things like using no_store=true, avoiding the need to tidy (most) certificates. Obviously this doesn't jive with ACME (which requires no_store=false on roles) and is unavoidable in other operational scenarios.
(Storage/Compute) They read+parse each certificate/ACME object/revocation entry and check if it has (presently) expired. Things like pause_duration help here, by decreasing the number of certs processed in a given interval. Using a different data structure could limit our search to more-likely-to-be-tidied certificates (e.g., maintaining another mapping by expiration) -- and perhaps help with the storage related cases in the best-case (many different expiration times).

We also have some customers run a huge number (3k+) PKI mounts, with lots of stored certificates per mount, making PKI tidies not only expensive in the local sense (of a mount), but in the global, cluster-resource expensive sense as well.

While a cron based scheduling is desirable for some, with this large number of mounts, if all of these PKI mounts tidied at exactly the same time, it would definitely overwhelm the cluster. By using interval_duration, one gets some natural variability (based on when the last tidy concluded, which is influenced by how long the tidy took), and so there's a chance for all 3k+ tidies not to be running simultaneously. The design too was in part influenced by the data we had available and an ease of simplicity: if the existing tidy has finished very recently, you're unlikely to want to schedule one right away.

If there are not a huge number of certificates in the mount (and thus, not starving the Vault cluster for memory), then (IMO) pause_duration should be sufficient to box the resource consumption into a sufficiently small window that regular issuance and revocation succeed just fine. Running such a resource-bound tidy automatically (regardless of when) has far more benefits than having precise scheduling. But obviously this is not true globally, and there's definitely value in cron-like scheduling as well.

Additionally, each mount type today has its own tidy operation with its own semantics. PKI's is one of the more complicated, and hence the introduction of an automated, cancel-able tidy, but Transform has one, AppRole has another, &c.

Here, IMO, we really need a Vault-wide tidy interface (\o hence the Core tag) that standardizes perhaps both mechanisms (interval-based and time-of-day-based) tidy running, that allows mutual-exclusion, perhaps time-boxed execution (to prevent tidy from continuing past some duration), and a standard UX to configure & enable tidies, regardless of the underlying mount type.

Note that this statement:

And add a check that if the cleaning is already in progress (there can be a lot of certificates, they missed the previous cleaning or something else), then do not start a new one.

already occurs under the existing design.

ser6iy commented 1 year ago

@cipherboy Yes, you are right, both schedulers for cleaning have their advantages in use, it would be ideal to have both types in the functionality and the ability to select them in the config. I hope this will be taken into implementation. Thanks for your detailed answer.

hashicorp / vault

PKI Secret Engine auto-tidy #21041