Observing flakiness with oidc-token using identity.entity.aliases.<mount accessor>.metadata.<metadata key>

sganapathy1312 commented 3 years ago

Describe the bug

Sometimes custom claims populated via identity.entity.aliases.<app_role_mount accessor>.metadata.<metadata key> , where metadata is stamped on the secret-id of the aprole, don't get populated. There is no pattern on when this happens. Sometimes it happens on a successive curl request to the token endpoint exposed by the oidc-role using the same vault token that returned an oidc token with all claims correctly set. Sometimes, it happens on a brand new token obtained by logging in using approle creds. Once it happens, successive iterations do not populate those claims at all unless a new vault token is obtained.

To Reproduce

Cannot describe a deterministic way to repro as this problem doesn't appear to occur in a deterministic manner. Instead will try our best to describe our setup

Overview We have a custom oidc-token template to identify our agent pollers looking for work from a central server. All pollers use the same approle but each poller has its own secret-id. This secret-id is stamped with a meta data for uniquely identifying the poller. For ex. Poller-A could have poller-a.json as its secret's meta data

{
  "metadata": "{ \"resource_id\": \"foobar\" }"
}

and its secret created and propagated via curl --header "X-Vault-Token: ..." --request POST --data @poller-a.json https://my-vault-endpoint/my-approle-role-url/secret-id. Similarly, we have a poller-b, whose metadata could look like

{
  "metadata": "{ \"resource_id\": \"barbaz\" }"
}

and its secret-id could have been generated by curl --header "X-Vault-Token: ..." --request POST --data @poller-b.json https://my-vault-endpoint/my-approle-role-url/secret-id

Details

Login to vault using one of the poller's approles vault write <my-app-role-path> role_id=... secret_id=...
Verify that the vault-token retrieved from above has token_meta_resource_id <my-resource-id>

Send a curl request to the oidc-role's token endpoint stamped with the vault token from above curl --header "X-Vault-Token: ..." https://<my-vault-endpoint>/v1/identity/oidc/token/<my-role>

<My role> is defined in tf such that there is one per approle mountpoint and we have one approle mountpoint per env stages. In broad strokes,

resource "vault_identity_oidc_role" "my-svc" {  
for_each = local.my-svc-envs
name     = "my-svc-${each.key}"
key      = vault_identity_oidc_key.key[each.key].name
ttl        = my-ttl  
template = XXX;EOF
{  
<other claims>
"entity_data": { 
<other claims>   
"resource_id": {{identity.entity.aliases.${vault_auth_backend.app[each.key].accessor}.metadata.resource_id}}  }}  
EOF}

Now decrypt the received oidc-token via jwt decode <my-token>

Repeat the above test in various flavors

Round robin across poller-A/poller-B vault approles.
Repeat the curl for fetching the oidc-token and the jwt decode for the same token which had previously returned a valid resource_id Additionally, performed this test as a golang program which runs all steps above using jose library for decoding the oidc and creating vault clients from scratch every 30 seconds.

The manual tests showed quite a few instances of resource_id not getting populated and there was no pattern to when it occured. Sometimes it would occur for a token for which resource_id previously got populated. Sometimes it occurred for a brand new token

golang tests are run for just one of the pollers in tight loop and every 30 seconds. When run on a tight loop, we saw 3 failures out of 4000 odd. 30 second interval test is ongoing. No failures so far after 36 iterations.

Expected behavior Expect the resource_id to be set every time

Environment:

Vault Server Version (retrieve with vault status):

<redacted>@<redacted> ~ % vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.6.2
Storage Type           postgresql
Cluster Name           vault-cluster-140e09ca
Cluster ID             62dba0b3-dedb-1844-1a16-e873f9cf4847
HA Enabled             true
HA Cluster             https://vault.vault:8201
HA Mode                standby
Active Node Address    https://vault.vault:8200

Vault CLI Version (retrieve with vault version):

<redacted>@<redacted> ~ % vault version
Vault v1.6.0 ('7ce0bd9691998e0443bc77e98b1e2a4ab1e965d4+CHANGES')

sganapathy1312 commented 3 years ago

Poller-A and Poller-B have the same role-id. They only differ in their secrets with each secret-id having its own metadata

sganapathy1312 commented 3 years ago

Reading the code, one thing that struck me was given that all our pollers use the same approle { and hence the same entity }, could it be possible that the entity that is retrieved via txn.First from the entities table in the line linked above somehow return an entity whose secret-id didn't end up setting the resourceId field.

This is just me speculating based on reading through the code. Looking forward to your responses.

sganapathy1312 commented 3 years ago

The golang based test further lends credence to the above theory.

I basically ran 2 go routines one running poller-A and the other running Poller-B. The test is super simple, run 2 go routines that implement the above steps

Login to vault using the respective approles.
Fetch the oidc-token
Decrypt it using jose

Instead of seeing 2 distinct resource_ids we just see 1 most of the time in our stdout logs.

At this point, I'm 70% sure this is the bug. Could you please confirm the same? We thought we get by with using 1 approle and 1 secret-id per poller. But looks like we need 1 approle per poller.

peterverraedt commented 3 years ago

I'm encountering the same bug in a different scenario, where a single approle role_id is used to create secret_ids for different hosts, to run a configuration management tool on each host reading secrets from vault using a token with its own secret_id. To restrict reading of certain paths in vault, we stamp metadata identifying the host on each secret_id, and we assign the role_id to a policy where this metadata is used. When running our configuration management tool on multiple servers at the same time, it seems that these policies are not correctly applied.

The following steps reproduce this error:

create approle role_id with policy X
create secret_id A with metadata {"key": "A"}
create secret_id B with metadata {"key": "B"}
create policy X to allow read permissions on secret/{{identity.entity.aliases.APPROLE ACCESSOR.metadata.key}}/*
create token A with secret_id A
read secret/A/test successfully (no 403 error) with token A
create token B with secret_id B
read secret/B/test successfully with token B
read secret/A/test with token A results in a 403 error
read secret/B/test with token A successfully

sganapathy1312 commented 3 years ago

Checking in to see about this and https://github.com/hashicorp/vault/issues/11803

pmmukh commented 3 years ago

Hi all! Thanks for submitting this issue, and figuring out the root cause for it! For the use case of multiple secretIDs on the same role with access to different paths, that is actually not really supported, in the sense that one RoleID is expected to correspond to one entity and one entity alias. The use cases mentioned here, we do think is a bit of an anti-pattern, as different secret paths should necessitate different role-ids. That being said, we do support secretIDs having metadata that is applied to the entity alias level, and is something that our users ( y'all ) are using. And given this is a possible way to use Approles, this is also a valid bug resulting from that flow. As such, we do want to figure out a way forward from here, so that the alias metadata templating behaves in a more deterministic way. The linked PR is not really an approach we want to take, as that creates multiple entity aliases per approle, which breaks from our current mapping model, and so will be closing that PR out. I'lll be marking this issue as internal since I think we want to figure out and implement this fix internally.

Also as a question for the people affected by this problem, is there any specific reason to not use multiple approles in this scenario, instead of 1 approle and multiple secrets ? I understand management overhead is one reason to not, but interested to hear if there are any others.

nvx commented 3 years ago

If https://github.com/hashicorp/vault/pull/10682 is merged, then the secret-id metadata could simply be added to the token metadata and then used in identity templating via that mechanism.

Single approle multiple secret ids makes sense logically as a easy way to group N servers in X role. This is especially useful for clustered services where you may have a lot of instances of what's effectively the same server. Allowing metadata to be defined per secret id allows defining the hostname of the box as metadata, which then can be used in eg the PKI secrets engine to allow that server to get a certificate for its specific hostname.

A single approle for this sort of use-case also means provisioning of policies a role has access to (which changes relatively infrequently and is a very privileged operation) and where it can be used (I use CIDRs on the approle to limit eg production approle tokens to only be able to be used by things on the production network range) can be separated from provisioning secret-ids which by necessity happens a lot more often (ie, whenever spinning up a new instance of a server and is done by more automated tools like Terraform/Atlantis/Jenkins/etc). The capability to use these sorts of safeguards as well as separation of privilege (between BAU scaling up and down of services, vs changing privileges that a system has access to) are very important in most organisations.

I do like the idea of not having to have an individual entity for each server (which would be the case having multiple approles or what the PR mentioned earlier would have caused) since this would cause a lot of entities to get created that are mostly meaningless and are not automatically cleaned up if eg the approle or secretid are deleted (eg when destroying the specific server).

For more static infrastructure it's less big of a deal, but when you're running immutable infrastructure by necessity you're spinning up and down VMs a lot, so management overhead and safeguards matter a lot.

pmmukh commented 3 years ago

Hi @nvx ! Thanks for the detailed explanation of use cases, its super valuable as we figure out how to solve for this! For the usage of Approles, I believe we are aligned in how they should be distributed, as in 1 application should have 1 role-id, with each server instance having a separate secret-id. And that the role-id i.e the application should control the priveleges of all the servers of the application. One use case mentioned here that I didn't particularly understand, was the accessing of different secret paths in the repro steps here https://github.com/hashicorp/vault/issues/11798#issuecomment-876295754, cause my understanding is that different instances of an app should access the same paths in Vault, so while those repro steps do show a bug I'm not sure why that access pattern would happen. The other use cases I think we agree is suboptimal right now, where the instances need custom claims on their jwt tokens, or custom certs from the pki engine, and want to do templating for that. That's something that I believe will currently require 1 role per instance. As far as the suggested solution goes, that approach does look interesting. I did find that in the past we have been opposed to this path https://github.com/hashicorp/vault/issues/5916, but I'm going to follow up with the team and see where we stand on it today / if anything has changed regarding that. This is all great insight into the issue, so thanks a ton!

nvx commented 3 years ago

One use case mentioned here that I didn't particularly understand, was the accessing of different secret paths in the repro steps here #11798 (comment), cause my understanding is that different instances of an app should access the same paths in Vault, so while those repro steps do show a bug I'm not sure why that access pattern would happen.

Identity templating can be used in other places than request URLs, but the case that stands out for me where you would want it in a request URL would be when storing per-server secrets in Vault (often clustered servers have the same sets of secrets across all servers, but some things for eg sidecar services that run on the host for other management purposes/etc than the actual clustered application itself) - I personally use this pattern to store secrets in the KVv2 secrets engine with the hostname of the server in the path go figure.

pmmukh commented 3 years ago

Aha sidecar services is an interesting pattern, and one I didn't think about, I can see how secretid metadata templating would be useful to solve problems with that, will be sure to keep this in mind as we think about this!

joeyberkovitz commented 3 years ago

@pmmukh - any chance there have been any updates related to this? Are there any plans to support the templating use case - most specifically in the case of the PKI engine? I think it would be fairly inconvenient to have to create one AppRole per host if every host will require exactly the same Vault access with the only variation being what hostname they're allowed to use in their requests.

If this won't be supported, please adjust the documentation to indicate that either an AppRole role should only be used by consumers that require exactly the same permissions or that the secret-id must always be generated with the same metadata for a given role. If this will be the case, it may just make more sense to move the metadata to the role to avoid any further confusion.

UXabre commented 3 years ago

I continued my work and effort on merging PR https://github.com/hashicorp/vault/pull/10682 but my use-case is mainly with OIDC/JWT backends but I guess that anything that supports mapping metadata on the token could be used (I'm just not too familiar yet with other auth backends, sorry)

heatherezell commented 7 months ago

Hi folks, is this still an issue with newer versions of Vault? Please let me know and I can bubble it up accordingly. Thanks!

tsipinakis commented 2 months ago

@heatherezell I ran into this issue as well - it is not clear from the documentation that the metadata field in the secret-ids are not meant to be used for path templating. The fact that they are supported explicitly in hcl encourages it. From reading the docs I had expected the behaviour that one secret-id/metadata combination creates a separate entity and it was even more misleading in the fact that it worked until it didn't when something conflicted.

In my opinion there are 2 ways to take this issue:

~~Make it clear this is not supported by removing support for using this metadata in hcl~~ Move the metadata from the secret to the role
Embrace it and add an option to use the accessor as the alias name

ssams commented 1 week ago

I also just stumbled into this, via https://github.com/hashicorp/vault/issues/12797 - basically got the same scenario, and assumed I could create one approle, then add multiple secrets with different metadata and use templated policies to restrict access for each secret (which represents different machines/deployments in my case). Only after running into errors I noticed via token lookups that all of them share the same entity ID and alias, completely breaking the expectation of being able to use the secret's metadata.

Not sure how to workaround this without creating one approle per deployment instead? which would lead to a lot of approles, which doesn't seem like it's the intended way either? I'd expect to simply be able to use secret/token metadata in the policies if one can assign it.

hashicorp / vault

Observing flakiness with oidc-token using identity.entity.aliases.<mount accessor>.metadata.<metadata key> #11798