Possible Azure AD "outage" with specific usage of azuread_service_principal

DMoenks commented 1 year ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritise this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritise the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureAD Provider) Version

Terraform v1.4.1
on windows_amd64
\+ provider registry.terraform.io/hashicorp/azuread v2.39.0
\+ provider registry.terraform.io/hashicorp/azurerm v3.64.0

Affected Resource(s)

data azuread_service_principal
resource azuread_service_principal

Terraform Configuration Files

data "azuread_service_principal" "service_principal_exchange_online" {
  # Provider: Office 365 Exchange Online
  application_id = "00000002-0000-0ff1-ce00-000000000000"
}

resource "azuread_service_principal" "service_principal_exchange_online" {
  # Provider: Office 365 Exchange Online
  application_id = "00000002-0000-0ff1-ce00-000000000000"
  use_existing   = true
}

Debug Output

Panic Output

Expected Behavior

The documentation better shouldn't suggest using the use_existing argument for resource azuread_service_principal with first-party applications by using phrases like [...]behaviour is useful[...] and [...]deletion fails (as it often the case for first-party Microsoft applications), but rather should include a warning to carefully handle first-party applications and suggest using data azuread_service_principal instead.
An information box titled Caveats doesn't really seem to emphasize the possible outcome either.

Actual Behavior

Using resource azuread_service_principal with use_existing = true as an argument will actually succeed in deleting some first-party Microsoft applications from Azure AD when running terraform destroy, potentially breaking other first-party and third-party applications in its wake, e.g. a Microsoft Exchange hybrid configuration between Exchange Online and Exchange server.

Steps to Reproduce

terraform apply
terraform destroy

Important Factoids

References

manicminer commented 1 year ago

Hi @DMoenks, thanks for raising this. I agree that this implementation is not ideal and we could probably explain this better in the docs. When you adopt a resource with Terraform, you should be prepared to manage its lifecycle with Terraform, including its eventual destruction, but in this case the resource is acting a little bit like a data source and doesn't force you to import, so I can definitely understand the ambiguity.

Though I would ask why you say this gotcha also applies to the data source?

DMoenks commented 1 year ago

Though I would ask why you say this gotcha also applies to the data source?

It actually doesn't apply to the data source. I just mentioned both of them to have the complete picture in the issue, for anyone taking a look at just the issue, although I maybe should have provided a more elaborate description...

I also would like to clarify that I think the way this is implemented might very well have its use cases here and there, so a more noticeable or explicit warning in the documentation might already do the trick

manicminer commented 1 year ago

Thanks for clarifying. For context, the current behavior was implemented to work around the issue of mixed behavior with principals for first party applications - some are preinstalled in the tenant, depending on the age of the tenant, some can be deleted, others not etc. Longer term I think we may want some form of provider-wide "registration" for these first party apps, similarly to how we register resource providers in the AzureRM provider.

jcetina commented 11 months ago

FWIW, we just got bit by this issue by accidentally deleting the service principal for an important first party app we had imported. This is a pretty significant footgun waiting to happen I think. It took us a while to figure out what had happened and how to restore it (re-registering the affected provider).

One of the things that can happen (which we experienced) was when we rolled forward and then reverted the azure ad terraform provider version from 2.31 to 2.40 and back. The older version uses a different environments library to provide the first party apps list than the environments library for 2.40. That means if you deploy then roll back, you'll add some new first party apps since the list in the go-azure-sdk library is longer, and then remove them because the old list is shorter.

manicminer commented 11 months ago

@jcetina Great feedback, thanks for reporting your experience. Whilst downgrading providers is not something we technically support, I can appreciate this would cause some consternation when it comes to principals for first-party apps.

We will look at improving the UX substantially here, both to make it easier to register first-party apps and also to safeguard against their removal.

jcetina commented 11 months ago

@jcetina Great feedback, thanks for reporting your experience. Whilst downgrading providers is not something we technically support, I can appreciate this would cause some consternation when it comes to principals for first-party apps.

We will look at improving the UX substantially here, both to make it easier to register first-party apps and also to safeguard against their removal.

Yeah. I knew that version downgrades wouldn't generate the best response, but these things happen in the wild. Glad the feedback helped. For completeness, we deleted the service principal for the "Azure Traffic Manager and DNS" first party app, which is basically needed by the Network Resource Provider. The fix was re-registering that provider. It would be good to know what apps correspond with what providers, etc, and that the procedures to identity recover from an accidental service principal delete/rotation are discoverable. Not sure that's your problem, but just wanted to say it out loud.

hashicorp / terraform-provider-azuread