oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 39 forks source link

Want API endpoints for updating IdP config #3125

Open askfongjojo opened 1 year ago

askfongjojo commented 1 year ago

This will be useful for two use cases:

  1. at onboarding time: fix any misconfiguration without having to delete the silo
  2. after onboarding: there is a need to change to another provider (though user has to take care of making sure that the users and groups already on the rack do not have any conflict with the new idp)
davepacheco commented 1 year ago

Yeah. There are a couple of different things here:

One is changing some of the connection parameters (e.g., DNS name, IP address, or signing certificate), while still talking to the same (logical) IdP. I think that's not controversial and is just a question of priority.

The other is migrating to a different IdP, by which I mean one whose ids are not guaranteed to match the ones already used. This one has a bunch of tricky failure modes that are both hard to detect ahead of time and hard to resolve. RFD 234 discusses this a bit in the "Alternatives considered" section. To give an example: suppose a person's id doesn't match between the IdPs. When they log in after the switch, we'll create a second account for them. There is no way for us to detect this problem because it looks to us like a new person. Any resources they had access to in their old account may be orphaned (i.e., there may be nobody with access any more, though hopefully there's a Silo Admin who can grant privileges on them). If they're resources tied to an account (like an ssh key, though if that's the only such resource then this probably isn't a big deal), a Silo Admin may not be able to help. They might also create more of these resources in the new account before figuring out what's happened and then want them to be merged (which would be a new feature we'd have to build). This is what I mean by "hard to detect ahead of time and hard to resolve".

A much worse problem might be that the new IdP uses the same identifiers to mean different things! e.g., "admin" in the new IdP now refers to, say, the administrative staff rather than "the superuser group". Or "bob" refers to a different person.

after onboarding: there is a need to change to another provider

Out of curiosity, did this one come from customer conversations? The reason I ask is that when we discussed this, we concluded that usually when big companies migrate IdPs, they preserve compatibility one way or another at the IdP. That's just my (potentially wrong) recollection of experiences reported by folks at Oxide who've worked in this area. If a customer's said they want this, it would be helpful to understand more about what they're doing so that we can figure out how best to solve the above problems.

askfongjojo commented 1 year ago

This came out of the product eng call this morning as we talked about the onboarding workflow. I have not filed this request previously because I was (vaguely) aware of some of the complications you mentioned above. So, yes, this is probably not a MVP or even MVP+1 issue. Sorry for rehashing something that has been discussed previously.

davepacheco commented 1 year ago

Not at all -- thank you for filing this! It's good to have a record and a place to discuss it. And we can certainly revisit any past decisions.

askfongjojo commented 6 months ago

The inability to fix IdP configuration issue has made first-time customer install a bit rocky during new silo setup.

We can perhaps allow IdP modifications in a more limited way without opening the door to supporting account/provider migration:

  1. Allow IdP settings to be deleted or modified prior to user accounts existing
  2. Post users existing, allow mutating request signing certificates only
augustuswm commented 5 months ago

Towards point 1. here, I wrote this up is a separate ticket but I think it is worth documenting in the omicron repo as a suggestion for make the "Configuration" step an explicit state of the IdP resource:

It is common when configuring a new SAML identity provider resource that not all of the user input fields will be correct on a first try. While we have some improvements we can make on our side for defaults (oxidecomputer/omicron#3049) and guides, it would be helpful to have a way to test a SAML connection. We have seen the pattern of "create silo + idp" -> "fail to sign in" -> "delete silo" -> "repeat" both during customer installs and internal testing.

We require silo deletion currently as removing or mutating an IdP resource can result in coherency issues as we no longer can be confident that the users that come in post mutation are the same as the previously seen users. Testing SAML is also difficult as it effectively requires a user to actually sign in to their IdP.

Instead I'd like to propose a state or mode field for IdP resources. @askfongjojo suggested this in our debrief today, but prior to users being JIT provisioned from an IdP we don't have any issues with changing the configuration of the IdP. I'd like to take this a step further to suggest that an IdP resource has a state that is one of:

  • Configuration
  • Active

When in the Configuration state, an IdP resource does not perform user or group JIT provisioning. Instead it completes the SAML request + response flow and returns back the identity information that would be JIT provisioned. While in the Configuration state all fields of the IdP resource are mutable. IdP resources in the Configuration state would be able to accept a request to "activate" them, transitioning their state to Active.

An IdP resource in the Active state would behave the same as an IdP resource does today. They are immutable, unable to be removed, and perform JIT provisioning. An IdP in the Active state can not be transitioned back to the Configuration state.

augustuswm commented 2 weeks ago

We are coming up on a SAML certificate rotation (November 18) for our Google connected internal silos on both dogfood and colo. I re-read through RFD 234 and it suggests that we considered endpoints for certificate rotation. I presume that is still the thinking here? (@inickles).

I am sure we can find a way to work around this internally, but I wanted to bump this up as it can come up quickly on a customer if they are using a SaaS IdP where they do not control the certificates.