hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.97k stars 4.19k forks source link

Bug: failed to persist packed storage entry: ValidationException: Item size has exceeded the maximum allowed size #23793

Open grzechukol opened 11 months ago

grzechukol commented 11 months ago

Describe the bug We are using Vault configured with DynamoDB backend. We have the Kubernetes auth method configured for the EKS cluster. When we are trying to log in to Vault from the Kubernetes cluster in order to fetch secrets we encounter the following issue:

$ curl \
    --request POST \
    --data '{"jwt": "'$TOKEN'", "role": "REDACTED"}' \
    https://REDACTED:8200/v1/auth/kubernetes-REDACTED/login
{"errors":["failed to persist packed storage entry: ValidationException: Item size has exceeded the maximum allowed size\n\tstatus code: 400, request id: REDACTED"]}

From the Vault logs, we can see the following error

Oct 23 13:09:46 REDACTED vault[443655]: 2023-10-23T13:09:46.420Z [DEBUG] identity: creating a new entity: alias="id:\"ec06a6cf-d6a1-a0a3-cda2-dd5a855abcea\" canonical_id:\"8a122839-7344-df49-6d33-91c2af318efe\" mount_type:\"kubernetes\" mount_accessor:\"auth_kubernetes_5d2c3df5\" mount_path:\"auth/kubernetes-REDACTED/\" metadata:{key:\"service_account_name\" value:\"REDACTED\"} metadata:{key:\"service_account_namespace\" value:\"testbgteut\"} metadata:{key:\"service_account_secret_name\" value:\"\"} metadata:{key:\"service_account_uid\" value:\"03ed7c22-9108-444d-8a45-e796f157942f\"} name:\"03ed7c22-9108-444d-8a45-e796f157942f\" creation_time:{seconds:1698067062 nanos:7869878} last_update_time:{seconds:1698067062 nanos:7869878} namespace_id:\"root\" local_bucket_key:\"packer/local-aliases/buckets/6\""

Currently, we have about 529938 identity entities (as reported by vault_identity_num_entities).

To Reproduce Steps to reproduce the behavior:

  1. Configure Vault with DynamoDB as a backend. Add Kubernetes as a auth method.
  2. Run curl --request POST --data '{"jwt": "'$TOKEN'", "role": "REDACTED"}' https://REDACTED:8200/v1/auth/kubernetes-REDACTED/login
  3. See error.

This error is non-deterministic, it occurs from time to time due to unknown reasons.

Expected behavior A clear and concise description of what you expected to happen.

Successful login.

Environment:

Vault server configuration file(s):

N/A

Additional context Add any other context about the problem here.

N/A

grzechukol commented 11 months ago

This issue is somehow similar to https://github.com/hashicorp/vault/issues/8761.

npurdy-tyro commented 4 months ago

We recently experienced this error. In our case it was not limited to the auth/k8s but extended to our auth/aws auth backend.

Just like you it didn't happen consistently, but instead happened sporadically. But it appeared to be happening more frequently.

Looking at the error we identified that when an entity was being written to DynamoDB it was somehow exceeding the 400kb row limit. This seemed impossible to us because the entity struct was quite small.

After digging into the Vault code we made some discoveries:

Vault does not store a single entity in a single storage (DynamoDB) row. It instead maintains a set of 256 hashed buckets that map 1 to 1 to a storage row. This is done through the StoragePacker, It then places auth entities into one of these 256 entities, based on the first byte of the MD5 sum of the entity ID. The entity ID is simply a UUID generated when an entity is created.

Therefore the problem effects all Auth backends in vault.

Vault also does not delete old entities, and in some cases like us it just keeps creating them, so there were hundreds of thousands of them.

So what begins to happen is these buckets begin to fill up. The contents are compressed so it takes a while. Maybe ~600,000 entities later. Once a bucket is full its no longer useable, and any hashed entity IDs that land in it will fail to persist to DynamoDB and you will see this error. This is why the error appears randomly, and without a pattern.

To 'fix' the problem you must delete entities that are no longer in use. If you have a max TTL that you configure and use the entities last accessed date then I believe its pretty safe.

Be warned however, if you do not fix the problem then more of the buckets will fill up until they are all full. Once they are all full you will be unable to login to vault at all. If you cannot login to Vault you may not be able to clean the entities and you may need to delete from DynamoDB directly to regain access.

Also be aware that a hostile client could leverage this knowledge to cause a denial of service on your Vault, so be sure to set appropriate rate limits to limit the speed that they can generate entities.

Moving forwards I wonder if implementing a identity/entity/tidy endpoint (like the PKI one) that takes a time input and deletes entities that haven't been accessed for greater than that time would make sense.

In your case @grzechukol for Kubernetes auth entities it could be worth researching if using the alias_name_source could help curb created entities.