hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.05k stars 4.2k forks source link

PKI/tokens tidy operation doesn't work if Consul KV size limit exceeded #7464

Open Nmishin opened 5 years ago

Nmishin commented 5 years ago

Describe the bug High-level issue - Vault restart to the sealed mode after 10-15 minutes that it was unsealed. In the Vault logs I see the errors: Sep 11 14:47:23 vault-0-4dvt vault[25580]: 2019-09-11T14:47:23.050-0500 [ERROR] expiration: failed to revoke lease: lease_id=pki/certs/issue/srv/pc85c22lUTQb1OYZpF4XYHZ2 error="failed to revoke entry: resp: (*logical.Response)(nil) err: error encountered during CRL building: error storing CRL: Unexpected response code: 413 (Value exceeds 524288 byte limit)" Sep 11 14:47:24 vault-0-4dvt vault[25580]: 2019-09-11T14:47:24.911-0500 [ERROR] expiration: failed to revoke lease: lease_id=pki/certs/issue/srv/I8I4keMA12V6ucnvzIkuzKQE error="failed to revoke entry: resp: (*logical.Response)(nil) err: error encountered during CRL building: error storing CRL: Unexpected response code: 413 (Value exceeds 524288 byte limit)"

To Reproduce This reproducible time-to-time in our low-level environments.

Expected behavior As I understand the tidy operation must be running automatically and delete the revoked certificates and tokens.

cat payload.json { "safety_buffer": "48h", "tidy_revoked_certs": true, "tidy_cert_store": true }

curl --header "X-Vault-Token: $VAULT_TOKEN" --request POST --data @payload.json http://127.0.0.1:8200/v1/pki/certs/tidy

Environment: vault status Seal Type shamir Initialized true Sealed false Total Shares 5 Threshold 3 Version 1.1.3 Cluster Name vault-cluster-eae9f171 Cluster ID 6934933a-62d9-06b2-3530-2c16f6f9c506 HA Enabled true HA Cluster https://x.x.x.x:8201 HA Mode active

Consul backend: consul -v Consul v1.2.2 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Additional context Seems this fixed by manually deleting revoked certs from the path: vault/logical/7a614a7c-94ad-df80-58e9-a2d7848eb6da/revoked But I'm not sure if this is the right action from my side.

Yes, I found a similar issue with recommendations: https://github.com/hashicorp/vault/issues/2844

But I try to figure out how it can be fixed if I already have failed cluster.

cipherboy commented 2 years ago

Just a note about this if someone stumbles across it: the actual tidy will work just fine, its just that the CRL build after tidy has exceeded Consul's limits. This means the status will be Error, but most of the work will have been done (and other CRL-rebuilding operations will fail).

We're waiting on proper large-entry support from Vault Core before we can address this in the PKI engine.

CRLs can be disabled and OCSP support can be used instead, which avoids the need for large storage entries.