hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.88k stars 1.95k forks source link

documentation for workload identity with Vault and federation #20097

Open benvanstaveren opened 7 months ago

benvanstaveren commented 7 months ago

Proposal

The documentation for workload identity needs a lot of work; at the moment a bunch of key items (at least, that I consider key items) are missing which makes it very hard to determine what, if any, impact a future upgrade will have. Most notably lacking is an explanation how to set up workload identity across multiple federated clusters. Given the fact that the Vault JWT auth endpoint requires you point it back at the JWKS URL in Nomad, it seems to imply each cluster needs it's own endpoint. This would very much make life incredibly complicated if cluster A goes tits up, and you reschedule everything on cluster B, but now you'll need all the roles required on cluster A defined on the auth endpoint for cluster B as well.

As someone with 4 clusters federated together this is kind of a deal breaker at the moment.

Also lacking is decent pointers and/or info towards how to migrate from the current Vault integration to the new workload identity thing; at least, there is some info available but it's all sort of scattered.

Use-cases

Making my life easier and letting me decide whether we're going to pin ourselves on Nomad 1.8 or not.

Lord-Y commented 7 months ago

@benvanstaveren We are on 2 regions and 2 nomad clusters. The 2nd is declared as a secindary cluster. We created:

Our setup is working but there are still issues with nomad and workload identity. Thousands 403 vault errors around 30min as my TTL is set to 1 hour. I make 2 rollbacks this week (4 in total). Hopefully I'll open new issues next week. You better wait.

benvanstaveren commented 7 months ago

@Lord-Y um, yeah, that's my point. I don't want to create 4 identical roles on 4 different auth endpoints (we have 1 vault cluster, 4 nomad clusters) - that's asking for something to be forgotten or otherwise overlooked and then we get to play the happy fun debug time to figure out why things aren't working.

I'm not concerned about 403 errors on Vault right now, I'm more concerned about clarification in the documentation that will decide whether or not we keep Nomad at all, or fork it at 1.8 and keep our own version, or (god forbid) switch to k8s. I mean, with the level of indirection workload identity seems to be requiring we may as well...

tgross commented 7 months ago

@Lord-Y let's keep any bug reports in a separate issue, please.

Hi @benvanstaveren! I've re-titled this issue to focus on the area that seems most directly in contention here.

Most notably lacking is an explanation how to set up workload identity across multiple federated clusters. Given the fact that the Vault JWT auth endpoint requires you point it back at the JWKS URL in Nomad, it seems to imply each cluster needs it's own endpoint. This would very much make life incredibly complicated if cluster A goes tits up, and you reschedule everything on cluster B, but now you'll need all the roles required on cluster A defined on the auth endpoint for cluster B as well.

Agreed that we're definitely lacking in guidance here. We'll make sure we get that resolved.

In the meanwhile, he Nomad keyring is replicated only within a region, so Workload Identities only apply within a single Nomad region. The way to allow multiple Nomad regions to use a single Vault cluster would be to configure the public keys in the Vault JWT Auth Method via jwt_validation_pubkeys.

You'll note that unfortunately this currently doesn't have a way of automatically keeping up-to-date the way the JWKS endpoint does. We're looking to resolve that in https://github.com/hashicorp/nomad/issues/19669 (cc @schmichael), and that will likely be a blocker to our deprecating the old Vault token-based workflow so that folks like you with federated clusters have an ergonomic way to operate it.

benvanstaveren commented 7 months ago

@tgross wouldn't it be an easier thing to replicate the keyring? I'm not entirely up on the internals of it all but I have a vague recollection that clusters do replicate things to eachother on the ACL end (hence the authoritative_region setting); wouldn't it be possible to piggyback on that mechanism? At least that way the behaviour would be similar to other ACL related things. At least that's what it feels like to me :) Let me know, and I can maybe make that another issue/feature request or something?

tgross commented 7 months ago

The key metadata is in Raft and so could easily use that same mechanism, but the cryptographic material is intentionally not because we shipped the initial implementation without https://github.com/hashicorp/nomad/issues/14852. Once that's done, it'd at least be a possibility.

benvanstaveren commented 7 months ago

The key metadata is in Raft and so could easily use that same mechanism, but the cryptographic material is intentionally not because we shipped the initial implementation without #14852. Once that's done, it'd at least be a possibility.

Maybe I'll create an issue for it referring back to that issue and this one to maybe keep it visible,because in reference to #19669 it seems to me that it would still require either vault to be able to automatically have the signing keys pushed to it or an external tool to sync up the new keys when they're made available - the former being a nicer thing than the latter because ideally there is no external tooling or reliance on someone remembering "oh we need to update X because..." :)

benvanstaveren commented 7 months ago

Okay so, created a new issue (#20123 ) as a proposal for the keyring replication. I'll leave it up to the powers that be to perhaps rename/organise/de-duplicate some of this stuff :)