Mount table corrupted after GCS rate limiting

bharanin commented 5 years ago

Summary

Under heavy load, Vault encounters GCS rate limiting on the mount table object (core/mounts) and occasionally corrupts data. It appears to insert duplicate entries in the table.

Log Snippet

vault: [ERROR] core: failed to persist mount table: error="1 error occurred:"  

vault: * error closing connection: googleapi: Error 429: The total number of changes to the object bucket/core/mounts exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded"

[ERROR] core: failed to persist mount table: error="1 error occurred:

[ERROR] core: failed to remove entry from mounts table: error="1 error occurred:

The cluster continues to operate normally after these errors until a leader election needs to take place. At that time, we see the following:

[ERROR] core: failed to mount entry: path=org/octopus/transit/ error="cannot mount under existing mount "org/octopus/transit/""

Because of this, no instance can become active and the cluster is unavailable. We’re not aware of any way to recover the storage after this error occurs and have resorted to restoring from backups (luckily this has been in our lab/test environment).

Other Details

Corruption doesn’t appear to happen every time rate limiting is encountered which makes this hard to reproduce.
Per GCS docs for the error, the rate limiting happens at ~1 write/second (https://cloud.google.com/storage/docs/key-terms#immutability) on a given object.
We’ve encountered this three times in our lab/test environment. In the most recent case, we made a copy of the storage to verify it was indeed corrupted, but were able to leave the active node in place to gather more data. The mount table as printed via vault read sys/mounts does not show a duplicate entry in the table - however, when trying to unseal in a separate cluster, we get the cannot mount under existing mount error.
We're using default max_parallel setting

heatherezell commented 6 months ago

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

fcrespofastly commented 4 months ago

@hsimon-hashicorp 👋🏻 I'm seeing similar issues in 1.15.4 also related to:

https://github.com/hashicorp/vault/issues/23635

We've been rate limited on the core/seal-config object:

log:   | \t* error closing connection: googleapi: Error 429: The object $BUCKET/core/seal-config exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429., rateLimitExceeded

(I think) this led to leases corruption due to GCS started returning 503s which in a Vault cluster we have with 3 replicas it all happened at the same time and the cluster got completely sealed and since we use auto-unseal we had to fix it by pointing it to a backup GCS backend. Btw advices on how to better fix that are appreciated (perhaps deleting the leases?):

   log: 2024-04-25T11:35:45.074Z [ERROR] expiration: error restoring leases:

 error=                                                                                                                                  
 | failed to read lease entry auth/kubernetes/login/LEASE_ID 1 error occurred: 
 | \t* error closing connection: googleapi: got HTTP response code 503 with body: Service Unavailable                                    
 |

Other than that I could spot several other 503s doing other kind of operations + context deadline exceeded or context canceled.

We're living in pretty dangerous situation at the moment so any help is appreciated!

fcrespofastly commented 4 months ago

somewhat related/ potential improvement:

https://github.com/hashicorp/vault/issues/26673

itspngu commented 2 months ago

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

We're also seeing the ratelimit problem on the 1.15, 1.16 and 1.17 branches of Vault, though thankfully no data corruption as far as we can tell. As is the case for @fcrespofastly, the specific path in the bucket being ratelimited is /core/seal-config. Even running a single instance, with ha_enabled = false in the storage "gcs" stanza triggers this behaviour.

We have 3 defined listeners, one used exclusively for intra-cluster traffic and the autohealing health check of the GCP MIG for the Vault cluster, one bound to localhost for metrics scraping + emergency access, and one bound to the instance's internal/VPC network interface/address, all being targeted by separate healthchecks. On startup, Vault successfully unseals, initializes mounts and starts loading leases, until eventually the GCS rate limit errors occur and Vault re-seals itself (presumably because it can't access the seal configuration - I'm not familiar with the codebase so I'm wondering why it needs to interact with GCS for that at all after having auto-unsealed successfully). At roughly the same time, a couple dozen "finished HTTP requests" to /sys/health appear in the logs, most of them reporting code 503, and a request duration of 2-3 minutes (!).

If anyone has ideas on what to try next (I doubt having a separate ha_storage stanza using internal/raft will help, as the code reads as if the "normal" storage backend is used to store seal config, but I'll try it anyways), I'm all ears. For now I'll try reducing the number of listeners to 2 and point the loadbalancer healthchecks to /sys/metrics instead of /sys/health to see if it's indeed the health endpoint doing funny things that don't quite work with GCS as the backend.

hashicorp / vault

Mount table corrupted after GCS rate limiting #7455