PKI mounts may end up in a broken state due to intermittent storage errors

Describe the bug Looks like intermittent storage failures during Vault startup/unseal can lead to PKI mounts ending up in a broken state. Vault will log the following lines:

[ERROR] secrets.pki.pki_2e0e9d63: Failed loading PKI migration status, staying in legacy mode: <..storage error..> [ERROR] secrets.pki.pki_2e0e9d63: Error during migration of PKI mount: <..storage error..> [ERROR] core: failed to initialize mount entry: path=pki/ error=<..storage error..>

and then the PKI mount is unusable indefinitely, even after the storage becomes healthy. When trying to e.g. generate a certificate, the following can be seen from Vault logs:

[INFO] secrets.pki.pki_0c477ce6: Using legacy CA bundle as PKI migration has not completed.

and on the caller side:

Code: 500. Errors: 1 error occurred:

could not fetch the CA certificate (was one set?): no legacy cert bundle exists

This can be reproduced on latest version (v1.12.2). Looks like the fault can be reproduced on v1.11.x track as well, after the introduction of initialize() function in PKI. I believe the function is not re-executed if it fails, which leads to this problem.

To Reproduce Steps to reproduce the behavior:

# 1. Start Vault
vault server -config vault.hcl

# 2. Initialize Vault
vault operator init -key-shares=$shares -key-threshold=$shares

# 3. Enable PKI backends
create_root_ca  # see below
create_sub_ca  # see below

# 4. Stop Vault
pkill vault

# 5. Remove read permissions in PKI backend
chmod 200 tmp/vault/vault-data/logical/bc2f917a-8c42-1fe6-808f-36e8d9cf6456/config/_legacyMigrationBundleLog

# 6. Start Vault & unseal
vault server -config vault.hcl
vault operator unseal $key

# 7. Observe errors, try to generate cert which fails
vault write -format=json sub/issue/example-dot-com common_name="test.example.com" ttl="24h"

# 8. Restore permissions in backend
chmod 600 tmp/vault/vault-data/logical/bc2f917a-8c42-1fe6-808f-36e8d9cf6456/config/_legacyMigrationBundleLog

# 9. PKI mount unusable, generate cert still fails
vault write -format=json sub/issue/example-dot-com common_name="test.example.com" ttl="24h"

Expected behavior Intermittent storage errors should be handled. Perhaps the code should try to migrate the PKI backend again at a later stage if the first try was unsuccessful?

Environment:

Vault Server Version (retrieve with vault status): v1.11.0+
Vault CLI Version (retrieve with vault version): v1.11.0+
Server Operating System/Architecture: any

Vault server configuration file(s):

$ cat vault.hcl 

storage "file" {
    path = "/tmp/vault/vault-data"
}

listener "tcp" {
  address = "127.0.0.1:8200"
  tls_disable = true
}

log_level = "trace"
api_addr = "http://127.0.0.1:8200"

PKI configuration:

cert_dir=/tmp/vault/vault-certs

create_root_ca() {
    vault secrets enable pki
    vault secrets tune -max-lease-ttl=87600h pki

    mkdir -p $cert_dir
    vault write -field=certificate pki/root/generate/internal \
        common_name="example.com" \
        issuer_name="root" \
        ttl=87600h > $cert_dir/root.pem

    vault write pki/roles/root-role allow_any_name=true

    vault write pki/config/urls \
        issuing_certificates="$VAULT_ADDR/v1/pki/ca" \
        crl_distribution_points="$VAULT_ADDR/v1/pki/crl"
}

create_sub_ca() {
    vault secrets enable -path=sub pki
    vault secrets tune -max-lease-ttl=43800h sub

    vault write sub/config/urls \
        issuing_certificates="$VAULT_ADDR/v1/sub/ca" \
        crl_distribution_points="$VAULT_ADDR/v1/sub/crl"

    vault write -format=json sub/intermediate/generate/internal \
        common_name="example.com Intermediate Authority" \
        issuer_name="example-dot-com-intermediate" \
        | jq -r '.data.csr' > $cert_dir/sub.csr

    vault write -format=json pki/root/sign-intermediate \
        issuer_ref="root" \
        csr=@$cert_dir/sub.csr \
        format=pem_bundle ttl="43800h" \
        | jq -r '.data.certificate' > $cert_dir/sub.pem

    vault write sub/intermediate/set-signed certificate=@$cert_dir/sub.pem

    vault write sub/roles/example-dot-com \
        issuer_ref="$(vault read -field=default sub/config/issuers)" \
        allowed_domains="example.com" \
        allow_subdomains=true \
        max_ttl="720h"
}

Additional context We have also seen this issue right after the mount point has been enabled, then the error seen is this one. Basically, if migrateStorage() fails, then pkiStorageVersion is never set to 1, so useLegacyBundleCaStorage always returns 0/nil. So it looks like the same issue as above, but a bit more difficult to reproduce due to the timing aspect.

I have a feeling that this is a generic problem with the Initialize()-function logic. The function is executed as a one-shot thing, which if it fails is not retried. So theoretically any backend using Initialize() may exhibit problems if it fails.

Code snippets:

startBackend sets the backend to loaded before running Initialize() here
But startBackend is only run if loaded is false, here

I do understand that we cannot run the whole startBackend several times, but should the load/initialize logic be separated into it's own routine that is run until success? Just some thoughts.

To answer this, I think we don't really expect initialize() to run multiple times (e.g., until success) in the plugin. This would be a breaking change if implemented globally at the Vault level I believe. Certainly it gets run multiple times over the course of a Vault process, but note that the backend itself gets destroyed and recreated when e.g., cluster leadership changes.

Within PKI, I think our expectation was that if something failed during the initialization, we'd let the operator deal with it and optionally reload the plugin.

As your reproducer demonstrates, I don't think the PKI backend itself can necessarily resolve the issue, and indeed, Vault itself might not be able to either, so even if we were to re-attempt it, we'd need an upper bound.

Fundamentally, I think within the PKI plugin, implementing this ourselves would require hooking every single entry point into the plugin (routes, invalidate, periodic, ...) and checking if we weren't initialized, which seems like a very invasive change for something we might not be able to solve.

The fallback behavior is a best-effort to allow some operations to continue, but as you noticed, is by no means complete or sufficient.

I wonder if we could limit the fallout a bit @stevendpclark if we don't see a legacyCABundle entry? E.g., if we can't read the log due to error, but we do know there is no legacy bundle, could we log the error and move into non-legacy mode anyways? I think we'd have to revert the behavior if we ever introduce a third migration, but given the second migration depends on legacyCABundle as the trigger... Thoughts?

I tend to agree with @cipherboy's description of the issue.

Singling out just the migration log entry I doubt is a realistic use-case of general storage errors, the most likely use case would be the entire path or something else has gone wrong. That leads me to believe the ability to recover just from that bad read would probably be of little value overall and just complicate the code base for little gain imho.

I think deferring to the operator in the use-case would be best overall position we should take.

Yeah honestly I don't know how it would be possible to fix the initialize() logic, it will be breaking change as you say. Every backend that implements it would need to have an idempotent implementation in order to not cause side effects.

The use case we have is that there could be hundreds of PKI backends in Vault at the same time. During startup there is momentarily a storage hiccup that causes one or more of these PKI backends to error out in the initialize() phase. So it's not only one key that is inaccessible, but the whole storage. The next moment storage is back to normal.

What could the operator do in this case? Seal/unseal in order to re-init the whole thing (but storage could experience problems again..)? Or can the operator "revive" a single PKI backend back to healthy state? We have tried to automate the operation of Vault as much as possible, so we would like to avoid having a person step in.

What do you think of a solution where migration status is checked as part of periodicFunc? If still in legacy mode then migration is made. As far as I know, this function is run as part of the rollback manager that executes once per minute, so in the worst case the PKI backend would be out of operation for 1 minute before being fully operational.

@nsimons It depends now on your Vault version. On Vault 1.11 and before, all mount initialization occurred in parallel. As of a soon-to-be-released Vault 1.11.7 I believe, you can set an envvar to control how many execute in parallel. In 1.12+, due to plugin restructuring, only a single mount initialization occurred at a time, but with a subsequent PR for 1.12.3+, you should be able to control parallelism there too. The variable is VAULT_POSTUNSEAL_FUNC_CONCURRENCY if you want to build a pre-release Vault version and test that.

To revive a particular backend, note that you can run a plugin reload with a scope=global, to cause it to be reloaded on all nodes, which'll reset the backend and reinitialize it.

As noted in the 1.11 PR above, Rollback manager continues to execute in full parallel (after the initial run which sets up mounts on 1.11) -- even in 1.12+. Say you don't set the envvar, and reach say, 50% with storage errors. On second initialization run, you'll now "fix" some amount, say, leaving you with 25% left. Then maybe 12.5%... &c until eventually you get below the threshold to fix this. Which seems weird, as now your consumers are stuck waiting an unknown amount of time, since they don't know which is truly available (part of the 1.12 changes is Vault isn't considered up until all mounts have finished, versus in 1.11 where the mount initialization was done on a first-come basis, other than periodic func triggering them all).

Or, you could just set the parallelism limiter to a value which, while slower, would allow you to avoid storage errors. :-)

FWIW, storage errors do to concurrency/load/... would make me really cautious to use that backend, as given a high enough load on the cluster normally, I'd expect you to still hit that anyways. Have you benchmarked the cluster and seen if you can reproduce these errors during steady-state operations?

Looking at #18667 it kinda looks like you have unintentionally -- and that even post-startup, during PKI CA initialization, you still get transitive storage errors as you're just creating so much load. I think perhaps I'd suggest evaluating your storage backend, as many things in Vault would not work ideally under load. E.g., even further, once these are setup, if you use no_store=false and generate tons of certs (perhaps EC based CA+leaves for speed), you'll likely hit the same storage transitive errors that might result in certs not being stored (but all the generation having been done, causing high load and forcing clients to retry), or similarly, issues with K/V usage or anything which reads/writes multiple entries. I.e., for any operation, you're going to end up chasing a lot of things trying to get Vault usable on flaky storage backend, which probably won't work out well without transactions and either way will cause lots of retries from clients.

Out of curiosity, what storage backend are you using?

I've probably been poorly explaining the background, sorry about that! It's not a load issue. We have Vault + ETCD backend running in Kubernetes and are looking to make our deployment more robust by pin-pointing error scenarios and their consequences.

We have been striving to have Vault automatically survive sudden and intermittent infrastructure issues, whether it is network delay, disk delay, sudden pod crashes, DNS outage, or generally any of the ETCD failures listed here (e.g. leader election leading to backend service outage). This is very important to us.

In our case we have a Kubernetes operator (another microservice) that automatically generates PKI mounts on the request of clients, so the Vault operator/admin is not part of the equation and cannot do the reload. Now it seems that a sudden infrastructure problem during generation of a PKI mount can cause the PKI to never be migrated, yielding it unusable.

Vault generally handles infrastructure issues very well, but these two new issues have popped up in Vault 1.11. There was another issue reported in Kubernetes auth method plugin, basically due to the same one-time-execute nature of Initialize(), but fortunately it was fixed.

We noticed that several PKI functions (e.g. root CA generation) check whether migration has happened or not, and then errors out. Is there some technical reason why this check couldn't kick off a migration? Just thinking of possible solutions.

Edit: Another scenario could be when auto-unseal is in use and Vault goes from sealed to unsealed state during operation, completely unattended. The admin is not present to detect PKI migration failures and resolve them.

@cipherboy, I would like to propose a PR where we check whether migration was done inside periodicFunc, if not we would try to run the migration. Would such PR be appreciated?

\o hey @nsimons!

In discussing this more with the team, we've decided we won't be addressing this ticket. Thank you for the offer of a PR, but we would not merge it.

Our concerns are mostly around burden on the Vault cluster and from a position of conservatism. We've had a few incidences where the large number of PKI mounts needing migration caused outages for customers. This manifested mostly as leadership thrashing, causing nodes to quit migrations partway through. Notably, this occurred post-Rollback Manager startup, as the quantity of mounts caused setup to past Rollback Manager starting. In some cases, this was exacerbated by the usage of Seal Wrapping, which we're unable to detect within the plugin itself.

While we've added some mitigating controls, using the Rollback Manager (and thus, PKI's PeriodicFunc) in this way, especially without more controls around runtime concurrency limits, would not address the issue. We believe it would likely contribute to workload during PeriodicFunc, putting more pressure on Vault and potentially a Seal Wrap mechanism, leading to even more leadership election problems.

While this leaves it up to operators to have monitoring in place to detect and remediate, given the concerns a Vault-initiated automated process could cause, we feel they're best situated to re-attempt (via reloading of specific mounts) migration and thus address this problem.

Thanks @cipherboy for getting back to me!

I wonder if I understand the Vault code correctly. Vault needs to check the migration status after each unseal, regardless if an actual migration needs to be done or not (pkiStorageVersion starts out with value 0 each time). An actual migration is not necessarily happening after unseal, Vault simply asks the storage backend for the migrationInfo-related keys and then sets pkiStorageVersion=1 if migration is already done. Surely this use case is not putting a burden on Vault? It's just standard configuration reads.

The implications of leaving this to the user are quite drastic. In the case where auto-unseal is used, it would mean that the user would need to develop their own monitoring system that

1) detects that Vault has been unsealed 2) for each PKI backend, create a loop that sends a request to see if the backend has migrated successfully, if not send a reload command, then continue the loop to re-check if migration went through and so on.

If this is really the way forward then I would hope to have this documented somewhere in the PKI section.

Right, but go a step deeper: what if a migration does need to occur? We recently bumped pkiStorageVersion=2 in 1.12. Why would we/operators expect a retry to load the mount post-migration, but not in the event of migration? :-) In the event of a migration needing to occur, the admin still needs that same type of monitoring. Relying on it only in the event of a migration needing to occur would mean it is a much less well tested system than otherwise.

Plus, as mentioned before, this is not unique to the PKI engine at all. Transform and AWS Auth also have the same issue.

Also, I feel there is a distinction between PKI/Transform/AWS Auth and CertAuth/K8S Auth is that the former are strictly doing migrations on startup and the latter are periodically refreshing configs from other nodes (reloading CRLs and TLS config respectively). The former set of plugins involves storage write operations, the latter appear to only require reads.

All in all, shoring up the reliability and availability of the underlying storage system should be considered, IMO.

Why would we/operators expect a retry to load the mount post-migration, but not in the event of migration? :-)

The fundamental problem is that we are trying to handle two use cases with the same code logic.

We have version upgrades which may cause migrations, so we do not retry due to load concerns.
and we have Vault unseal operation which will not cause a migration.

The frequency and pre-conditions of these two use cases are vastly different.

The first case happen (currently) at most two times, in two specific upgrade scenarios. The operator is knowingly executing the upgrade and is most likely present and can act on the various error scenarios. The operator might even have read the release notes and knows that a PKI migration is about to happen.

The second case can happen any time during normal operation, for example in the case of Kubernetes, pod terminations are common place. Sure, the operator might be present to unseal Vault, but might not have the necessary knowledge to recognize PKI initialization failures, nor act on them. Then as I mentioned, we have the unmanaged case with auto-unsealers.

I would argue that the second case is much more frequent than the first.

In the extreme case, pkiStorageVersion=2 is the last possible migration for years to come, but Vault will still need to be unsealed in the future.

So I am of the strong opinion that the code should cater for the second use case, more so than the first. If load is an issue with the first use case, then I think we ought to make the trade-off that Vault works "asymmetrically" when handling potential retries. If the operator already expects that PKI migrations will need manual recovery, I don't see why they couldn't have the same expectation even in the future. We could even add a warning printed to the log periodically, to remind the operator that migration is not done and needs manual recovery. What I mean is, something like:

// retry logic

if migrationDone() || migrationNotNeeded() {
  // happy path, nothing to do
} else {
  // do not retry migration due to load concern, instead put a warning in the log.
}

Luckily, checking migrationDone() can be done through the atomic integer, so it will not cause any reads to the storage backend.

Alright, this is my final attempt to convince you, I will stop bugging you about this now, promise ;)

hashicorp / vault

PKI mounts may end up in a broken state due to intermittent storage errors #18583