hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.02k stars 4.2k forks source link

Root CA getting generated even if error is returned #18667

Open nsimons opened 1 year ago

nsimons commented 1 year ago

Describe the bug We came across a scenario where generating root CA failed (error was returned) but the CA was actually generated successfully.

Upon inspecting the code, the culprit is here. The CA is generated and persisted but there are other actions that follow. If any of those actions fail, an error is returned to the caller, which suggests that the CA generation failed. But actually it succeeded. For example, in our case the function errors out when trying to build the CRL: b.crlBuilder.rebuild().

This can lead to some nasty side effects

To Reproduce Easiest way is to modify the code to always return an error from b.crlBuilder.rebuild(). Then simply try to generate a root CA by following the tutorial: Generate Root CA. Even if an error is returned the CA is visible e.g. through the /issuers endpoint.

Expected behavior An error should not be returned if CA generation is actually successful. Alternatively, CA should not be persisted if an error is being returned.

Additional context Happened on 1.11.3, but also reproducible on current main branch.

cipherboy commented 1 year ago

@nsimons I think this is desired behavior, within the bounds of Vault today.

A warning is meant to indicate there is more to do, or something might not have been done quite as intended.

An error indicates something went wrong, and IMO, may not necessarily be fatal.

In our case, Vault today lacks storage transactions, making it nearly impossible to fully and cleanly revert a partial operation.

Here, an error occurred during CRL building. This isn't fatal -- the issuer was still generated and is discoverable under the APIs -- but indicates that more-up-to-date CRLs will be unavailable until this is addressed and rectified. This could lead to operational issues if not addressed, as if clients attempt to fetch more up-to-date CRLs, none may be available.

nsimons commented 1 year ago

Right, yeah it's a classical problem.

Would it then make sense that the client always tries to delete the issuer and start over if it encounters an error? Because the client cannot really know whether an error is something that it can ignore (e.g. CRL will be rebuilt later) or something that it should act on (CA generated, but e.g. serial number was not stored (so revoke not possible?)). We can't really inspect error messages since those can change between Vault versions.

cipherboy commented 1 year ago

I don't think that's sufficient in all cases. To take your other issue (#18583 - sorry :-), if you were to chmod revocation entries, you could end up with a state wherein 1. all previous issuer creations would've succeeded, and 2. any subsequent issuer creation would fail since CRL building will now fail. Here, the client cleaning up and retrying wouldn't necessarily help: the issue isn't with the new issuer, but some other underlying bug with storage.

In general, as much as we can, we do attempt to keep key substrings of error messages stable, but I do agree in general you can't rely on it...

What about a usage-based testing approach? You can list issuers+keys before/after, and attempt to issue a new (short-lived) cert under it, attempt to revoke it, rotate the CRL (and optionally, run a tidy with a short safety buffer once the temporary leaf has expired).

If you don't see the issuer, you know it really is a fatal error somewhere (perhaps bad parameters?), if you're able to use it but just not able to revoke (and so could flag the mount to a human operator), or if it all works fine.

(The assumption here is that you need automated creation of Root CAs, which seems a little interesting as a use case but could be an interesting problem in general :-)

nsimons commented 1 year ago

No worries, the background info are the same across the two issues ;), basically we have hundreds of PKI backends and sometimes the storage has a hiccup. So for us re-creating the issuer and trying again could help I believe (we only have one issuer per PKI mount).

Usage-based testing is a good idea and that's what we are now trying to do. We check the issuers endpoint in order to determine whether the CA was created or not. So far it works, but I feel it's not that future proof.

If CRL (re)building is not fatal, would it make sense to handle the error case differently?

    // Build a fresh CRL
    err = b.crlBuilder.rebuild(sc, true)
    if err != nil {
        return nil, err
    }

Issue a warning and continue (resp.AddWarning), return resp, err, or kick off the CRL rebuild as a go-routine and continue?