hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
31.21k stars 4.22k forks source link

transit mount, storage gcs, error: rollback: error rolling back - context deadline exceeded #23566

Open craftey opened 1 year ago

craftey commented 1 year ago

We face the issue in our dev and production environments. To reproduce the issue with fresh vault I tested some versions locally on a MacBook with empty storage backend gcs. I found that version 1.9.4 and 1.9.10 do not have the issue. With 1.10.0 and eg 1.11.1 or 1.14.4 the error can be reproduced. So I believe this bug was introduced in 1.10.0 and has never been fixed in higher versions.

Bug description 3-5 mins after startup of the server and then every hour we see in the log:

[ERROR] rollback: error rolling back: path=customer-keys/
  error=
  | 75013 errors occurred:
  | \t* failed to read value for "logical/<uuid>/policy/0001": Get "https://storage.googleapis.com/<bucket-name>/logical/<uuid>/policy/0001": context deadline exceeded
...

To Reproduce

  1. Install vault locally, eg brew install vault
  2. Create a config file vault-config.hcl with following contents, use a nice bucket name:
    disable_mlock = true
    listener "tcp" {
    address = "127.0.0.1:8200"
    tls_disable = "true"
    }
    storage "gcs" {
    bucket = "mycompany-myproject-vault-gcs-test"
    ha_enabled = "true"
    }
    api_addr = "http://127.0.0.1:8200"
  3. Create the GCS bucket:
    gsutil mb -p my-gcp-project -l europe-west3 -c standard gs://mycompany-myproject-vault-gcs-test
  4. Start server: vault server -config vault-config.hcl
  5. In 2nd terminal, init vault and create new transit endpoint:
    export VAULT_ADDR='http://127.0.0.1:8200'
    vault operator init -key-shares=1 -key-threshold=1 > init_response
    cat init_response | grep 'Unseal Key 1:' | sed 's/Unseal Key 1: //' > unseal_key
    cat init_response | grep 'Initial Root Token:' | sed 's/Initial Root Token: //' > root_token
    vault operator unseal $(cat unseal_key)
    vault login $(cat root_token)
    vault secrets enable -path=customer-keys -force-no-cache=true transit
  6. Create 5000 keys with a shell script add_keys.sh:
    
    #!/usr/bin/env bash

total=50 for c in seq -w 0 $((total-1)); do for i in seq -w 0 99; do vault write -f customer-keys/keys/$c$i >/dev/null & done >/dev/null 2>&1 wait echo $((1$c+1-100))00/${total}00 done

chmod +x add_keys.sh ./add_keys.sh

count written keys, 4000 keys or more is enough to reproduce the issue

vault list -format=yaml customer-keys/keys | wc -l

7. Stop server in 1st terminal with CTRL+C, start server in 1st terminal, 
   `vault server -config vault-config.hcl 2>&1 | grep -v "failed to read value"`
   The piped grep statement removes the trace lines from the error msg to better see the first line of the error msg.
8. In 2nd terminal: `vault operator unseal $(cat unseal_key)` (**Important step! Do not forget this unseal step!**)
9. Wait 3-5 minutes
10. See error in server log in 1st terminal

[ERROR] rollback: error rolling back: path=customer-keys/

11. Cleanup:
- Stop server in 1st terminal with CTRL+C
- Delete GCS bucket `gsutil -m rm -r gs://mycompany-myproject-vault-gcs-test 2>/dev/null`

**Environment:**
* Vault Server Version 1.10.0 to 1.14.4
* Vault CLI Version 1.10.0 to 1.14.4

**Additional context**

This error does not happen when keycount is low, eg 1000.
This error does not happen with versions smaller than 1.10.0.
Raft storage does not produce the error.
When creating the transit backend the option `-force-no-cache=true` can be omitted, the error is reproducible also without this option.

**Questions**

What does the rollback manager do at startup and every hour?
Is this error critical? What are the consequences of this error?
Can/should we rollback production environment from 1.12.x to 1.9.x?
Can someone reproduce and fix the error?

Thanks in advance,
Craftey

**Note:**
To install older vault versions with brew I did:

curl https://raw.githubusercontent.com/Homebrew/homebrew-core/a0ce0e6ce3c921a26db90dfe8c38b4df9f227669/Formula/vault.rb > /tmp/vault.rb # version 1.10.0

brew reinstall --formula /tmp/vault.rb


 Hashes of other versions can be found here: [vault.rb history](https://github.com/Homebrew/homebrew-core/commits/master?path%5B%5D=Formula&path%5B%5D=vault.rb)
craftey commented 1 year ago

Hi @jefferai . I mention you here, because you gave some help some time ago here https://github.com/hashicorp/vault/issues/5746 with vault gcs storage backend. I politely want to ask if you maybe have some advice for us regarding the above issue. A shorter summary of the issue can also be read here https://discuss.hashicorp.com/t/transit-mount-storage-gcs-error-rollback-error-rolling-back-context-deadline-exceeded/58930. We want to update vault to latest version, but see errors in log with rollback-manger in conjunction with gcs-storage backend when using latest vault version. Thanks in advance.

toannguyen-invisible-klara commented 6 months ago

Dear all Any feedback about this issue? Thanks

craftey commented 6 months ago

Hi @hsimon-hashicorp Can you get the right people looking at this? My original description contains a minimal example that should be easily reproducible. Also I pinned the version when the issue started happening. And I had some questions at the end of my post, unfortunately no one had time to quickly answers some of them. Thanks in advance.

vnazar commented 6 months ago

I am experiencing the same error using the Transit engine with a Postgres backend. In my case, I have around ~500,000 keys created. According to what can be seen in the code comments in vault/rollback.go, this rollback is caused by partial errors in the operations, but it's not clear to me whether this means that certain operations for key creation, encryption, or decryption are not being performed. Is this critical? Is it possible to see in detail which errors are occurring?

Error:

[ERROR] rollback: error rolling back: path=transit/
error=
  | 121794 errors occurred:
  | \t* context deadline exceeded
  | \t* context deadline exceeded
  | \t* context deadline exceeded
...

Environment:

Thanks in advance,

heatherezell commented 6 months ago

Hi @hsimon-hashicorp Can you get the right people looking at this? My original description contains a minimal example that should be easily reproducible. Also I pinned the version when the issue started happening. And I had some questions at the end of my post, unfortunately no one had time to quickly answers some of them. Thanks in advance.

Thanks for the ping! I've re-surfaced this issue with our engineering teams. Hopefully we can collectively get to the bottom of this!

vnazar commented 5 months ago

Hi @hsimon-hashicorp! Is there any news about this?

Thanks