Inconsistent behaviour when specifying `subscription_id` attribute in provider block for new Subscriptions

hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs

Mozilla Public License 2.0

4.47k stars 4.55k forks source link

Inconsistent behaviour when specifying `subscription_id` attribute in provider block for new Subscriptions #17979

Closed krowlandson closed 2 days ago

krowlandson commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Community Note

Please vote on this issue by adding a :thumbsup: reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.2.7

AzureRM Provider Version

3.18.0

Affected Resource(s)/Data Source(s)

azurerm_subscription

Terraform Configuration Files

# Base provider, used to read the billing account and create the new Subscription
provider "azurerm" {
  features {}
}

# Data resource to get the billing_scope value
data "azurerm_billing_enrollment_account_scope" "example" {
  billing_account_name    = local.billing_account_name
  enrollment_account_name = local.enrollment_account_name
}

# Create the new Subscription (uses the Microsoft.Subscription/Aliases provider)
resource "azurerm_subscription" "example" {
  subscription_name = "My Test Subscription"
  alias             = "new-sub-test-001"
  billing_scope_id  = data.azurerm_billing_enrollment_account_scope.example.id
}

# Define a secondary provider with an alias, configured to target the newly created Subscription
provider "azurerm" {
  alias           = "new_sub"
  subscription_id = azurerm_subscription.example.subscription_id
  features {}
}

# Create a Resource Group in the newly created Subscription
resource "azurerm_resource_group" "example" {
  provider = azurerm.new_sub
  name     = "new-sub-test"
  location = "West Europe"
}

Debug Output/Panic Output

# see below

Expected Behaviour

When trying to create a new Azure Subscription and then cross-referencing the Subscription ID of the new Subscription into a provider block we expect to see consistent behaviour regardless of the authentication method used.

Actual Behaviour

When authenticating with a Service Principal the initial terraform plan throws the following error:

╷
│ Error: building AzureRM Client: 1 error occurred:
│       * A Subscription ID must be configured when authenticating as a Service Principal using a Client Secret.
│
│
│
│   with provider["registry.terraform.io/hashicorp/azurerm"].new_sub,
│   on main.tf line 20, in provider "azurerm":
│   20: provider "azurerm" {
│
╵

We believe this to be correct, as at this point the new Subscription ID is unknown and therefore cannot be correctly set on the provider.

When we authenticate using the Azure CLI the initial terraform plan is successful, but we get the following error during terraform apply:

╷
│ Error: building AzureRM Client: obtain subscription(00e960fd-76a5-410b-9391-877383a5974f) from Azure CLI: parsing json result from the Azure CLI: waiting for the Azure CLI: exit status 1: ERROR: Subscription '00e960fd-76a5-410b-9391-877383a5974f' not found. Check the spelling and casing and try again.
│ 
│   with provider["registry.terraform.io/hashicorp/azurerm"].new_sub,
│   on main.tf line 20, in provider "azurerm":
│   20: provider "azurerm" {
│
╵

We believe in this case that the provider is incorrectly accepting a null or empty string input on the subscription_id attribute of the provider block and instead defaulting to the current default Subscription associated with the Azure CLI authenticated user.

This is incorrectly giving customers the illusion that they can create a new Subscription and then deploy resources straight into it within a single apply of a single root module.

Steps to Reproduce

Create a configuration matching the example, providing valid values for access to the billing account for Subscription create
Authenticate using a Service Principal with suitable permissions
terraform init (should work)
terraform plan (should throw error)
Authenticate using Azure CLI with suitable permissions
terraform init (should work)
terraform plan (should work)
terraform apply (should create new Subscription and then throw error)
terraform apply (should throw the following error)
az account list --refresh (to refresh the list of available Subscriptions)
May need to run az account set to select a subscription
terraform apply (should work the second time as the Subscription now exists!)

Important Factoids

n/a

References

I believe this issue identifies the underlying reason why customers are trying to create a new Subscription and then deploy resources to it, only to face issues which surface as issues such as:

14863
17541

We have successfully worked around this situation with our lz-vending module, but this was only made possible as the new AzAPI provider has an option to set the parent_id for any resource being created,

We believe the addition of a new optional parent_id argument for all resources in the azurerm provider would allow customers to deploy resources across multiple Subscriptions without hitting this issue (or the need to provide multiple aliased providers).

krowlandson commented 1 year ago

For additional clarity, what we are initially asking is for the same checks to be put in place so the provider behaves consistently when authenticating with either the Azure CLI (i.e. user scenario) or a Service Principal / Managed Identity.

Currently a user can configure a provider block with subscription_id = "" # i.e. empty string without Terraform raising an error when authenticating with Azure CLI. This is because with the Azure CLI, the provider appears to default to using whichever Subscription is currently set within the Azure CLI user session.

We believe this to be incorrect, or at least a poor user experience which could also theoretically result in resources being created in the wrong Subscription in some scenarios.

magodo commented 1 year ago

@krowlandson Thank you for raising this together with the clear description!

Let me first talk about the Azure CLI auth method. As is documented, users are expected to specify the default subscription prior to auth via CLI in TF. The document clearly states that if the subscription_id is not set, the default subscription selected by the CLI will be used. Otherwise, users shall set the subscription_id explicitly. In the latter case, if the specified subscription id is not found, TF will raise error: https://github.com/hashicorp/terraform-provider-azurerm/blob/0357a5709aa5a18afbcb37d2c263acbeb2216a36/vendor/github.com/hashicorp/go-azure-helpers/authentication/auth_method_azure_cli_token.go#L288-L303

Personally, I think this is a short hand for CLI auth method, which saves the users to specify the subscription_id. Whilst, I also agree it makes sense to always explictly set the subscription_id to avoid any surprise.

Regarding the error you got when apply via the CLI auth, it seems due to the missing refresh on the account list. For this I'll create a PR to fix.

krowlandson commented 1 year ago

Thank you @magodo… I can see how the PR you raised will help remove the need to manually refresh the account list, and I assume this will help a user running the above code when using CLI auth.

However, will this further worsen the scenario where a customer writes some code which “works on my workstation” and then cause frustration when the same code doesn’t then work in a pipeline using SPN auth?

Also, I completely agree with the subscription ID selection process for when a customer hasn’t explicitly set a value for subscription_id, but I think we should also catch situations where subscription_id is specified in the provider block but without a valid value. This would then give consistent behaviour then between CLI auth and other auth methods.

magodo commented 1 year ago

@krowlandson I agree we should differ between whether subscription_id is absent versus subscription_id is set to empty string. @tombuildsstuff What's your thought on this?

magodo commented 1 year ago

As mentioned above, given an empty subscription_id the provider will hit error when authenticated using SP during the plan stage, but not when using the CLI auth (as it will use the default subscription in this case). Ideally, either case should work in @krowlandson's config, as the second provider is referencing a known-after-apply attribute, which Terraform is able to handle it correctly:

provider "azurerm" {
  alias           = "new_sub"
  subscription_id = azurerm_subscription.example.subscription_id
  features {}
}

The behavior about how provider behaves when referencing such attribute is explained here (though this is for the plugin framework, but it should also apply to the plugin sdk v2). As is stated in the document:

You can choose how your provider handles this. If some resources or data sources can be used without knowing that value, it may be worthwhile to emit a warning and check whether the value is set in resources and data sources before attempting to use it. If resources and data sources can't provide any functionality without knowing that value, it's often better to return an error, which will halt the apply.

For this provider, IMO it should behave correctly without a subscription id during the plan stage: For existing resources, the subscription id can already be deduced by the resource id; For new resources, there is nothing to be done during plan stage.

Unfortunately, it is a non-trivial work to make the unknown value work for the azurerm provider at this moment. That is because:

The provider is currently heavily based on the Azure SDK track1. In this SDK, each client has its subscription defined (example). The provider will initialize all the clients (each one corresponds to a certain resource type) at the configuration phase, then each TF resource will reference the corresponding client in their CRUD functions. So each resource type client plays as a singleton in the provider wide, this brings a constraint that these singleton client should be stable regarding the embeded subscription id.
The provider configuration phase requires to make use of some certain clients. E.g. it will use the resource client to register the RPs. This requires the subscription id has to be known at plan stage.

For the point 1, I think after we totally migrate to the hashicorp/go-azure-sdk, it should be resolved. For the point 2, I think we can further balance whether we want the dynamic provider config more than those additional features during provider configuration phase.

manicminer commented 1 year ago

Thanks for raising this @krowlandson and thanks both for the nuanced conversation. I think there's merit in the idea of requiring a subscription_id to always be set regardless of the authentication method. If this were to be considered, it would naturally be considered a breaking change and could only reasonably be enforced in the next major version of the provider.

That aside, I'm not sure we can reliably compare between an unset subscription ID and a blank subscription ID - mainly because we also source this from an environment variable and so the unset value is nearly always an empty string AFAICT.

manicminer commented 1 year ago

A further thought on this - what is the value of having the subscription_id be unknown until apply time? Wouldn't that cause the plan to be inaccurate/unusable?

tombuildsstuff commented 1 year ago

@manicminer yeah agreed, I think we should probably look to make subscription_id required in 4.0 to alleviate the confusion.

There's been a number of issues where folks have been confused about this behaviour (by having resources provisioned in the wrong subscription or otherwise) - and whilst this behaviour made sense within CloudShell (for learning purposes) - it doesn't really make sense outside of that environment, so I think it's worth making this field required going forward.

halvoroppen commented 1 year ago

I am experiencing the same issue as originally described above. In my case it seems like the error is related to service principal authentication in combination with remote backend for state management.

I’m running with terraform version v1.3.9 and azurerm provider 3.46.0

To prove this, I have tested tree scenarios:

Interactive az cli authentication with remote backend configured: I’m able to successfully plan and apply with remote backend configured. Subscription is created, and resource group is added into the newly created subscription.

Service principal authentication with remote backend configured: Terraform plan throws an error: │ Error: building account: unable to configure ResourceManagerAccount: subscription ID could not be determined and was not specified │ │ with provider["registry.terraform.io/hashicorp/azurerm"].newsub, │ on test.tf line 24, in provider "azurerm": │ 24: provider "azurerm" {

Service principal authentication with local backend (no backend configured) I’m then able to both run plan and apply successfully. Result, subscription is created, and resource group added into newly created subscription.

Based on this testing, it seems there is an issue with the combination of remote state and service principal authentication. Can this please be investigated?

manicminer commented 1 year ago

@halvoroppen We don't recommend creating new a subscription and adding resources to that subscription in the same module or apply operation. The reason it works with Azure CLI authentication is that the subscription ID is sourced at plan time - which means the plan is working against the wrong subscription - and because the correct subscription is yet to be created, the plan happens to be somewhat correct. But this is just coincidence. When authenticating as a service principal without specifying an existent subscription ID, the plan fails because the subscription ID cannot be determined. This is the expected behaviour.

The coincidental behaviour of planning with the wrong subscription ID when using CLI authentication, is likely to change in version 4.0 of the provider, and as such as don't recommend relying on this undocumented behaviour.

halvoroppen commented 1 year ago

@manicminer, Thank you for your response. This might not be recommended, and yes, we can split the deployment into separate deployments even if that would complicate our use case. Having a single deployment is really something we would like to be supported, a single deployment make sense, and will ease deployments and our use case! Hopefully this is functionality that will have greater support in coming versions, and not less.

As stated above, the deployment runs successfully even with service principal authentication, as long as you run without a remote backend. If I run with local backend, the plan and apply stage runs successfully. If I specify a remote backend, it fails.

I have done some additional testing, it seems that the plan is not checking against the current/default subscription, there is no signs for this in the output (or in the trace log):

terraform plan
data.azurerm_billing_mca_account_scope.this: Reading...
data.azurerm_billing_mca_account_scope.this: Read complete after 0s [id=/providers/Microsoft.Billing/billingAccounts/bxxx/billingProfiles/xxx/invoiceSections/xxx]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # azurerm_resource_group.rg will be created
  + resource "azurerm_resource_group" "rg" {
      + id       = (known after apply)
      + location = "norwayeast"
      + name     = "test"
    }

  # azurerm_subscription.this will be created
  + resource "azurerm_subscription" "this" {
      + alias             = "tf-test-1"
      + billing_scope_id  = "/providers/Microsoft.Billing/billingAccounts/xxx/billingProfiles/xxx/invoiceSections/xxx"
      + id                = (known after apply)
      + subscription_id   = (known after apply)
      + subscription_name = "tf-test-1"
      + tenant_id         = (known after apply)
    }

  # null_resource.refresh_subs will be created
  + resource "null_resource" "refresh_subs" {
      + id = (known after apply)
    }

  # time_sleep.sub_create will be created
  + resource "time_sleep" "sub_create" {
      + create_duration = "10s"
      + id              = (known after apply)
    }

Plan: 4 to add, 0 to change, 0 to destroy.

Just as an additional verification, I have tried to add the resource group (test) into the default subscription just to see if that changed this behaviour, it did not – this is as expected since I can’t see any trace towards the default subscription that terraform is checking if that resource exists.

So, based on this testing, it seems like the verification of the resource group is not happening towards the default subscription. The issue only occurs if I specify a remote backend in combination with service principal authentication, so I assume there must be an issue related to this combination.

For your reference, I’m using this code to illustrate the issue:


terraform {
  required_version = ">=1.3.0"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "=3.46.0"
    }
    time = {
      source  = "hashicorp/time"
      version = "0.9.1"
    }
  }
  # If I add a backend config (unncomment the below), and specify credentials for the backend, the plan and apply will fail
  # backend config:
  # backend "azurerm" {
  #   resource_group_name  = ""  # Fill in required info
  #   storage_account_name = ""  # Fill in required info
  #   container_name       = ""  # Fill in required info
  #   key                  = ""  # Fill in required info
  #   subscription_id      = ""  # Fill in required info
  #   tenant_id            = ""  # Fill in required info
  # }

  #Set these environment variables to enable authentication to Azure backend using service principal 
  # export ARM_SUBSCRIPTION_ID=""
  # export ARM_CLIENT_SECRET=""
  # export ARM_TENANT_ID=""
  # export ARM_CLIENT_ID=""
}
provider "azurerm" {
  alias           = "newsub"
  subscription_id = azurerm_subscription.this.subscription_id
  features {
  }
}

provider "azurerm" {
  features {
  }
}

variable "billing_account_name" {
  description = "Billing Account Name for MCA account (guid)"
  type        = string
}
variable "billing_profile_name" {
  description = "Billing Profile ID"
  type        = string
}
variable "invoice_section_name" {
  description = "Invoce Section ID"
  type        = string
}
variable "subscription_name" {
  description = "Name of the new subscriptoin to create"
  type        = string
}

data "azurerm_billing_mca_account_scope" "this" {
  billing_account_name = var.billing_account_name
  billing_profile_name = var.billing_profile_name
  invoice_section_name = var.invoice_section_name
}

resource "azurerm_subscription" "this" {
  subscription_name = var.subscription_name
  billing_scope_id  = data.azurerm_billing_mca_account_scope.this.id
  alias             = var.subscription_name
}

resource "azurerm_resource_group" "rg" {
  depends_on = [
    null_resource.refresh_subs
  ]
  provider = azurerm.newsub
  name     = "test"
  location = "norwayeast"
}

resource "time_sleep" "sub_create" {
  depends_on = [
    azurerm_subscription.this
  ]
  create_duration = "10s"
}

resource "null_resource" "refresh_subs" {
  depends_on = [
    time_sleep.sub_create
  ]
  provisioner "local-exec" {
    command = "az account list --refresh"
  }
}

rbtcollins commented 10 months ago

Forgive my commenting without enough context, but I don't understand what prevents the provider treating the unknown value correctly.

That is:

check for Value.IsFullyKnown() / Value.IsKnown() on the provider parameters - I quote A common pattern is to use the Value.IsKnown() method to confirm that a value is known, then to use the Value.As() method to retrieve the underlying data for use..
when provider parameters are unknown, propogate that to all data sources in ModifyPlan and similar verbs by returning UnknownValue
when the attribute is populated, start answering requests

That might still have some edge cases - but it should meet the following behaviours that seem to be expressed as desirable:

a null, empty, or missing subscription_id can still fall back to the azcli or other lookup methods
interpolating a value into the provider configuration should cause all the resources and data sources defined on the provider to be unusable until the interpolated value is available (that is, after CreateResource on the subscription itself has called Set during its CreateRequest.

manicminer commented 10 months ago

@rbtcollins Thanks for commenting. Unfortunately the issue is not so much that we have difficulty determining the subscription ID, it is that the subscription ID is fundamentally unknown at plan time (in the original configuration provided for this issue). Because the subscription does not yet exist, we therefore cannot proceed with a plan.

The only reason that it does currently work when using az-cli authentication, is because the provider falls back to the default subscription provided by az-cli. In this instance, we actually don't want to fall back to the subscription ID provided by az-cli, because we're planning with the wrong subscription which is contrary to expectations and a breach of the configuration. We'll most likely prevent this situation from arising, by making subscription_id a required provider property, in v4.x of the provider - since, correctly or not, this will be a breaking behavioral change. It will still be perfectly possible to interpolate a subscription ID (as indeed it is currently), as long as it is known at plan time.

rbtcollins commented 9 months ago

https://github.com/hashicorp/terraform/issues/30937 has some relevant notes here that can explain some of the cognitive dissonance (where e.g. I and others make comments that don't align with the expectations of the devs).

In particular from that bug:

there was an old provider SDK that didn't expose Unknown values semantically, leading to an interpolated string (like we're dealing with in the subscription_id use case) being evaluated as an unset (or NULL) value. If the azurerm is written using that old SDK, then it won't be able to process the difference between unset and unknown, which would lead to the observed behaviour here.
The newer 'Provider Framework' corrects this, but still doesn't explicitly describe how to handle Unknown's, leading to e.g. the awscc provider erroring on Unknowns.

That expresses that there are some sharp edges in Terraform as a framework today.

However, it also seems that using the newer provider framework, it is possible - for instance the kubernetes provider handles this situation very elegantly: one can have a single terraform root module that describes:

The provisioning of an EKS/AKS/GKE kubernetes cluster via the relevant vendor-specific provider
Resources in the resulting kubernetes cluster, using a kubernetes provider where the API endpoint, server certificate and so on are all interpolated - something like

provider "kubernetes" {
  host                   = module.aks_cluster.cluster_endpoint
  cluster_ca_certificate = base64decode(module.aks_cluster.cluster_certificate_authority_data)

  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "kubelogin"
    args = [
      "get-token", 
      "--environment",
       "AzurePublicCloud",
       "--server-id",
       module.aks_cluster.server_id,
       "--client-id",
       module.aks_cluster.client_id,
       "--tenant-id",
       module.aks_cluster.tenant_id,
       "--login"
       "devicecode"
      ]
  }
}

So what I'm concretely getting at is, can we aim at the kubernetes provider behaviour for azurerm?

-> Terraform's core isn't perfect, but has enough capability to support this for well structured and modern providers

-> there might be a lot of work required to get there (e.g. migrate to the new Provider Framework) or do some incompatible-changes which require careful sequencing.

manicminer commented 9 months ago

@rbtcollins Thanks for the clarification, but unfortunately this is not the issue. Apologies if I haven't made it clear - this can become quite complex due to the variety of configurations and means of configuration that we support. It is unfortunately not an issue of whether or not the provider is technically able to proceed with unknown values in its configuration - the issue is fundamentally that the subscription you want to plan against doesn't physically (electrically?) exist, nor is it possible to determine whether it exists, until after the first apply, so it's impossible to offer a trustworthy plan in such circumstances.

It's also not practical to make the assumption that every resource needs creating and proceed anyway, because that might not be true in the event the subscription does exist but isn't in state. If that's the case for you, then the subscription is known to the practitioner and should be imported, or the subscription ID should be set in the configuration (making the problem irrelevant).

Whilst I agree there are some commonalities with the Kubernetes provider being pointed at a to-be-created cluster, it's not quite the same given the variety of practitioner configurations in the wild for Azure, our inability to distinguish a practitioner's intent when environment variables are offered as a means of setting provider config values, whilst supporting out-of-band authentication methods like Azure CLI, and the myriad of permissions related concerns that arise during a plan/apply operation. For these largely safety related reasons, we are unlikely to be able to support proceeding with a plan against an unknown or nonexistent subscription.

Our current recommendation, as per earlier comments, is that your configuration is split into two modules where the first handles subscription creation and the second [set of] module[s] manages resources inside those subscriptions, and that those modules are applied in distinct operations.

rbtcollins commented 9 months ago

@manicminer I think you've made it clear, I will happily admit I don't understand all the interactions you describe - they sound like technical debt, and perhaps you could consider this something to address in some future major version bump. It would be fantastic to bring the azure provider up to parity with e.g. the google provider, or the kubernetes provider, in this regard.

Re this bit though:

It's also not practical to make the assumption that every resource needs creating and proceed anyway, because that might not be true in the event the subscription does exist but isn't in state. If that's the case for you, then the subscription is known to the practitioner and should be imported, or the subscription ID should be set in the configuration (making the problem irrelevant).

This is an XY problem I think: out of state subscriptions should be imported; there are existing facilities (now much better since 1.5) for doing declarative imports, and indeed an imported subscription would not show as having an Unknown ID, so should plan and operate entirely fine today. It is only the wholly new case that any putative change would need to solve.

manicminer commented 9 months ago

@manicminer I think you've made it clear, I will happily admit I don't understand all the interactions you describe - they sound like technical debt, and perhaps you could consider this something to address in some future major version bump. It would be fantastic to bring the azure provider up to parity with e.g. the google provider, or the kubernetes provider, in this regard.

Tech debt only accounts for a minority of the problem - specifically that it's not possible to distinguish between a missing or a blank configuration value. This will likely improve when we are able to migrate from terraform-plugin-sdk to terraform-plugin-framework although this is going to be some time.

This is an XY problem I think: out of state subscriptions should be imported; there are existing facilities (now much better since 1.5) for doing declarative imports, and indeed an imported subscription would not show as having an Unknown ID, so should plan and operate entirely fine today. It is only the wholly new case that any putative change would need to solve.

That's true of course for 1.5+ (when a practitioner makes use of import blocks), but unfortunately isn't as straightforward for earlier Terraform versions, especially a usage of legacy versions e.g. pre-1.x which we still support. There are a wide variety of users and configurations out there.

Going back to the earlier points in this thread, this will continue to conflict with supporting Azure CLI for authentication. We need to look at this more closely in 4.0, which might involve making subscription_id a required property, or changing the way we lean on Azure CLI. But we'll investigate more deeply prior to making that decision, with a view to try and support dynamic configurations - but as above in the meantime we're not in a position to be able to improve usability here without breaking some users.

MoussaBangre commented 1 week ago

Any solution about that ? Also having the same issue. I am using the vending module to create subscription and would like to add more services onto them.

hashicorp / terraform-provider-azurerm

Inconsistent behaviour when specifying `subscription_id` attribute in provider block for new Subscriptions #17979

Is there an existing issue for this?

Community Note

Terraform Version

AzureRM Provider Version

Affected Resource(s)/Data Source(s)

Terraform Configuration Files

Debug Output/Panic Output

Expected Behaviour

Actual Behaviour

Steps to Reproduce

Important Factoids

References

14863

17541