Closed krowlandson closed 2 days ago
For additional clarity, what we are initially asking is for the same checks to be put in place so the provider behaves consistently when authenticating with either the Azure CLI (i.e. user scenario) or a Service Principal / Managed Identity.
Currently a user can configure a provider block with subscription_id = "" # i.e. empty string
without Terraform raising an error when authenticating with Azure CLI. This is because with the Azure CLI, the provider appears to default to using whichever Subscription is currently set within the Azure CLI user session.
We believe this to be incorrect, or at least a poor user experience which could also theoretically result in resources being created in the wrong Subscription in some scenarios.
@krowlandson Thank you for raising this together with the clear description!
Let me first talk about the Azure CLI auth method. As is documented, users are expected to specify the default subscription prior to auth via CLI in TF. The document clearly states that if the subscription_id
is not set, the default subscription selected by the CLI will be used. Otherwise, users shall set the subscription_id
explicitly. In the latter case, if the specified subscription id is not found, TF will raise error: https://github.com/hashicorp/terraform-provider-azurerm/blob/0357a5709aa5a18afbcb37d2c263acbeb2216a36/vendor/github.com/hashicorp/go-azure-helpers/authentication/auth_method_azure_cli_token.go#L288-L303
Personally, I think this is a short hand for CLI auth method, which saves the users to specify the subscription_id
. Whilst, I also agree it makes sense to always explictly set the subscription_id
to avoid any surprise.
Regarding the error you got when apply
via the CLI auth, it seems due to the missing refresh on the account list. For this I'll create a PR to fix.
Thank you @magodo… I can see how the PR you raised will help remove the need to manually refresh the account list, and I assume this will help a user running the above code when using CLI auth.
However, will this further worsen the scenario where a customer writes some code which “works on my workstation” and then cause frustration when the same code doesn’t then work in a pipeline using SPN auth?
Also, I completely agree with the subscription ID selection process for when a customer hasn’t explicitly set a value for subscription_id
, but I think we should also catch situations where subscription_id
is specified in the provider block but without a valid value. This would then give consistent behaviour then between CLI auth and other auth methods.
@krowlandson I agree we should differ between whether subscription_id
is absent versus subscription_id
is set to empty string. @tombuildsstuff What's your thought on this?
As mentioned above, given an empty subscription_id
the provider will hit error when authenticated using SP during the plan stage, but not when using the CLI auth (as it will use the default subscription in this case). Ideally, either case should work in @krowlandson's config, as the second provider is referencing a known-after-apply attribute, which Terraform is able to handle it correctly:
provider "azurerm" {
alias = "new_sub"
subscription_id = azurerm_subscription.example.subscription_id
features {}
}
The behavior about how provider behaves when referencing such attribute is explained here (though this is for the plugin framework, but it should also apply to the plugin sdk v2). As is stated in the document:
You can choose how your provider handles this. If some resources or data sources can be used without knowing that value, it may be worthwhile to emit a warning and check whether the value is set in resources and data sources before attempting to use it. If resources and data sources can't provide any functionality without knowing that value, it's often better to return an error, which will halt the apply.
For this provider, IMO it should behave correctly without a subscription id during the plan stage: For existing resources, the subscription id can already be deduced by the resource id; For new resources, there is nothing to be done during plan stage.
Unfortunately, it is a non-trivial work to make the unknown value work for the azurerm provider at this moment. That is because:
For the point 1, I think after we totally migrate to the hashicorp/go-azure-sdk, it should be resolved. For the point 2, I think we can further balance whether we want the dynamic provider config more than those additional features during provider configuration phase.
Thanks for raising this @krowlandson and thanks both for the nuanced conversation. I think there's merit in the idea of requiring a subscription_id
to always be set regardless of the authentication method. If this were to be considered, it would naturally be considered a breaking change and could only reasonably be enforced in the next major version of the provider.
That aside, I'm not sure we can reliably compare between an unset subscription ID and a blank subscription ID - mainly because we also source this from an environment variable and so the unset value is nearly always an empty string AFAICT.
A further thought on this - what is the value of having the subscription_id be unknown until apply time? Wouldn't that cause the plan to be inaccurate/unusable?
@manicminer yeah agreed, I think we should probably look to make subscription_id
required in 4.0 to alleviate the confusion.
There's been a number of issues where folks have been confused about this behaviour (by having resources provisioned in the wrong subscription or otherwise) - and whilst this behaviour made sense within CloudShell (for learning purposes) - it doesn't really make sense outside of that environment, so I think it's worth making this field required going forward.
I am experiencing the same issue as originally described above. In my case it seems like the error is related to service principal authentication in combination with remote backend for state management.
I’m running with terraform version v1.3.9 and azurerm provider 3.46.0
To prove this, I have tested tree scenarios:
Interactive az cli authentication with remote backend configured: I’m able to successfully plan and apply with remote backend configured. Subscription is created, and resource group is added into the newly created subscription.
Service principal authentication with remote backend configured: Terraform plan throws an error: │ Error: building account: unable to configure ResourceManagerAccount: subscription ID could not be determined and was not specified │ │ with provider["registry.terraform.io/hashicorp/azurerm"].newsub, │ on test.tf line 24, in provider "azurerm": │ 24: provider "azurerm" {
Service principal authentication with local backend (no backend configured) I’m then able to both run plan and apply successfully. Result, subscription is created, and resource group added into newly created subscription.
Based on this testing, it seems there is an issue with the combination of remote state and service principal authentication. Can this please be investigated?
@halvoroppen We don't recommend creating new a subscription and adding resources to that subscription in the same module or apply operation. The reason it works with Azure CLI authentication is that the subscription ID is sourced at plan time - which means the plan is working against the wrong subscription - and because the correct subscription is yet to be created, the plan happens to be somewhat correct. But this is just coincidence. When authenticating as a service principal without specifying an existent subscription ID, the plan fails because the subscription ID cannot be determined. This is the expected behaviour.
The coincidental behaviour of planning with the wrong subscription ID when using CLI authentication, is likely to change in version 4.0 of the provider, and as such as don't recommend relying on this undocumented behaviour.
@manicminer, Thank you for your response. This might not be recommended, and yes, we can split the deployment into separate deployments even if that would complicate our use case. Having a single deployment is really something we would like to be supported, a single deployment make sense, and will ease deployments and our use case! Hopefully this is functionality that will have greater support in coming versions, and not less.
As stated above, the deployment runs successfully even with service principal authentication, as long as you run without a remote backend. If I run with local backend, the plan and apply stage runs successfully. If I specify a remote backend, it fails.
I have done some additional testing, it seems that the plan is not checking against the current/default subscription, there is no signs for this in the output (or in the trace log):
terraform plan
data.azurerm_billing_mca_account_scope.this: Reading...
data.azurerm_billing_mca_account_scope.this: Read complete after 0s [id=/providers/Microsoft.Billing/billingAccounts/bxxx/billingProfiles/xxx/invoiceSections/xxx]
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# azurerm_resource_group.rg will be created
+ resource "azurerm_resource_group" "rg" {
+ id = (known after apply)
+ location = "norwayeast"
+ name = "test"
}
# azurerm_subscription.this will be created
+ resource "azurerm_subscription" "this" {
+ alias = "tf-test-1"
+ billing_scope_id = "/providers/Microsoft.Billing/billingAccounts/xxx/billingProfiles/xxx/invoiceSections/xxx"
+ id = (known after apply)
+ subscription_id = (known after apply)
+ subscription_name = "tf-test-1"
+ tenant_id = (known after apply)
}
# null_resource.refresh_subs will be created
+ resource "null_resource" "refresh_subs" {
+ id = (known after apply)
}
# time_sleep.sub_create will be created
+ resource "time_sleep" "sub_create" {
+ create_duration = "10s"
+ id = (known after apply)
}
Plan: 4 to add, 0 to change, 0 to destroy.
Just as an additional verification, I have tried to add the resource group (test) into the default subscription just to see if that changed this behaviour, it did not – this is as expected since I can’t see any trace towards the default subscription that terraform is checking if that resource exists.
So, based on this testing, it seems like the verification of the resource group is not happening towards the default subscription. The issue only occurs if I specify a remote backend in combination with service principal authentication, so I assume there must be an issue related to this combination.
For your reference, I’m using this code to illustrate the issue:
terraform {
required_version = ">=1.3.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "=3.46.0"
}
time = {
source = "hashicorp/time"
version = "0.9.1"
}
}
# If I add a backend config (unncomment the below), and specify credentials for the backend, the plan and apply will fail
# backend config:
# backend "azurerm" {
# resource_group_name = "" # Fill in required info
# storage_account_name = "" # Fill in required info
# container_name = "" # Fill in required info
# key = "" # Fill in required info
# subscription_id = "" # Fill in required info
# tenant_id = "" # Fill in required info
# }
#Set these environment variables to enable authentication to Azure backend using service principal
# export ARM_SUBSCRIPTION_ID=""
# export ARM_CLIENT_SECRET=""
# export ARM_TENANT_ID=""
# export ARM_CLIENT_ID=""
}
provider "azurerm" {
alias = "newsub"
subscription_id = azurerm_subscription.this.subscription_id
features {
}
}
provider "azurerm" {
features {
}
}
variable "billing_account_name" {
description = "Billing Account Name for MCA account (guid)"
type = string
}
variable "billing_profile_name" {
description = "Billing Profile ID"
type = string
}
variable "invoice_section_name" {
description = "Invoce Section ID"
type = string
}
variable "subscription_name" {
description = "Name of the new subscriptoin to create"
type = string
}
data "azurerm_billing_mca_account_scope" "this" {
billing_account_name = var.billing_account_name
billing_profile_name = var.billing_profile_name
invoice_section_name = var.invoice_section_name
}
resource "azurerm_subscription" "this" {
subscription_name = var.subscription_name
billing_scope_id = data.azurerm_billing_mca_account_scope.this.id
alias = var.subscription_name
}
resource "azurerm_resource_group" "rg" {
depends_on = [
null_resource.refresh_subs
]
provider = azurerm.newsub
name = "test"
location = "norwayeast"
}
resource "time_sleep" "sub_create" {
depends_on = [
azurerm_subscription.this
]
create_duration = "10s"
}
resource "null_resource" "refresh_subs" {
depends_on = [
time_sleep.sub_create
]
provisioner "local-exec" {
command = "az account list --refresh"
}
}
Forgive my commenting without enough context, but I don't understand what prevents the provider treating the unknown value correctly.
That is:
A common pattern is to use the Value.IsKnown() method to confirm that a value is known, then to use the Value.As() method to retrieve the underlying data for use.
.That might still have some edge cases - but it should meet the following behaviours that seem to be expressed as desirable:
subscription_id
can still fall back to the azcli
or other lookup methods@rbtcollins Thanks for commenting. Unfortunately the issue is not so much that we have difficulty determining the subscription ID, it is that the subscription ID is fundamentally unknown at plan time (in the original configuration provided for this issue). Because the subscription does not yet exist, we therefore cannot proceed with a plan.
The only reason that it does currently work when using az-cli authentication, is because the provider falls back to the default subscription provided by az-cli. In this instance, we actually don't want to fall back to the subscription ID provided by az-cli, because we're planning with the wrong subscription which is contrary to expectations and a breach of the configuration. We'll most likely prevent this situation from arising, by making subscription_id
a required provider property, in v4.x of the provider - since, correctly or not, this will be a breaking behavioral change. It will still be perfectly possible to interpolate a subscription ID (as indeed it is currently), as long as it is known at plan time.
https://github.com/hashicorp/terraform/issues/30937 has some relevant notes here that can explain some of the cognitive dissonance (where e.g. I and others make comments that don't align with the expectations of the devs).
In particular from that bug:
there was an old provider SDK that didn't expose Unknown values semantically, leading to an interpolated string (like we're dealing with in the subscription_id use case) being evaluated as an unset (or NULL) value. If the azurerm is written using that old SDK, then it won't be able to process the difference between unset and unknown, which would lead to the observed behaviour here.
The newer 'Provider Framework' corrects this, but still doesn't explicitly describe how to handle Unknown's, leading to e.g. the awscc
provider erroring on Unknowns.
That expresses that there are some sharp edges in Terraform as a framework today.
However, it also seems that using the newer provider framework, it is possible - for instance the kubernetes
provider handles this situation very elegantly: one can have a single terraform root module that describes:
kubernetes
provider where the API endpoint, server certificate and so on are all interpolated - something likeprovider "kubernetes" {
host = module.aks_cluster.cluster_endpoint
cluster_ca_certificate = base64decode(module.aks_cluster.cluster_certificate_authority_data)
exec {
api_version = "client.authentication.k8s.io/v1beta1"
command = "kubelogin"
args = [
"get-token",
"--environment",
"AzurePublicCloud",
"--server-id",
module.aks_cluster.server_id,
"--client-id",
module.aks_cluster.client_id,
"--tenant-id",
module.aks_cluster.tenant_id,
"--login"
"devicecode"
]
}
}
So what I'm concretely getting at is, can we aim at the kubernetes provider behaviour for azurerm?
-> Terraform's core isn't perfect, but has enough capability to support this for well structured and modern providers
-> there might be a lot of work required to get there (e.g. migrate to the new Provider Framework) or do some incompatible-changes which require careful sequencing.
@rbtcollins Thanks for the clarification, but unfortunately this is not the issue. Apologies if I haven't made it clear - this can become quite complex due to the variety of configurations and means of configuration that we support. It is unfortunately not an issue of whether or not the provider is technically able to proceed with unknown values in its configuration - the issue is fundamentally that the subscription you want to plan against doesn't physically (electrically?) exist, nor is it possible to determine whether it exists, until after the first apply, so it's impossible to offer a trustworthy plan in such circumstances.
It's also not practical to make the assumption that every resource needs creating and proceed anyway, because that might not be true in the event the subscription does exist but isn't in state. If that's the case for you, then the subscription is known to the practitioner and should be imported, or the subscription ID should be set in the configuration (making the problem irrelevant).
Whilst I agree there are some commonalities with the Kubernetes provider being pointed at a to-be-created cluster, it's not quite the same given the variety of practitioner configurations in the wild for Azure, our inability to distinguish a practitioner's intent when environment variables are offered as a means of setting provider config values, whilst supporting out-of-band authentication methods like Azure CLI, and the myriad of permissions related concerns that arise during a plan/apply operation. For these largely safety related reasons, we are unlikely to be able to support proceeding with a plan against an unknown or nonexistent subscription.
Our current recommendation, as per earlier comments, is that your configuration is split into two modules where the first handles subscription creation and the second [set of] module[s] manages resources inside those subscriptions, and that those modules are applied in distinct operations.
@manicminer I think you've made it clear, I will happily admit I don't understand all the interactions you describe - they sound like technical debt, and perhaps you could consider this something to address in some future major version bump. It would be fantastic to bring the azure provider up to parity with e.g. the google provider, or the kubernetes provider, in this regard.
Re this bit though:
It's also not practical to make the assumption that every resource needs creating and proceed anyway, because that might not be true in the event the subscription does exist but isn't in state. If that's the case for you, then the subscription is known to the practitioner and should be imported, or the subscription ID should be set in the configuration (making the problem irrelevant).
This is an XY problem I think: out of state subscriptions should be imported; there are existing facilities (now much better since 1.5) for doing declarative imports, and indeed an imported subscription would not show as having an Unknown ID, so should plan and operate entirely fine today. It is only the wholly new case that any putative change would need to solve.
@manicminer I think you've made it clear, I will happily admit I don't understand all the interactions you describe - they sound like technical debt, and perhaps you could consider this something to address in some future major version bump. It would be fantastic to bring the azure provider up to parity with e.g. the google provider, or the kubernetes provider, in this regard.
Tech debt only accounts for a minority of the problem - specifically that it's not possible to distinguish between a missing or a blank configuration value. This will likely improve when we are able to migrate from terraform-plugin-sdk to terraform-plugin-framework although this is going to be some time.
This is an XY problem I think: out of state subscriptions should be imported; there are existing facilities (now much better since 1.5) for doing declarative imports, and indeed an imported subscription would not show as having an Unknown ID, so should plan and operate entirely fine today. It is only the wholly new case that any putative change would need to solve.
That's true of course for 1.5+ (when a practitioner makes use of import
blocks), but unfortunately isn't as straightforward for earlier Terraform versions, especially a usage of legacy versions e.g. pre-1.x which we still support. There are a wide variety of users and configurations out there.
Going back to the earlier points in this thread, this will continue to conflict with supporting Azure CLI for authentication. We need to look at this more closely in 4.0, which might involve making subscription_id
a required property, or changing the way we lean on Azure CLI. But we'll investigate more deeply prior to making that decision, with a view to try and support dynamic configurations - but as above in the meantime we're not in a position to be able to improve usability here without breaking some users.
Any solution about that ? Also having the same issue. I am using the vending module to create subscription and would like to add more services onto them.
Is there an existing issue for this?
Community Note
Terraform Version
1.2.7
AzureRM Provider Version
3.18.0
Affected Resource(s)/Data Source(s)
azurerm_subscription
Terraform Configuration Files
Debug Output/Panic Output
Expected Behaviour
When trying to create a new Azure Subscription and then cross-referencing the Subscription ID of the new Subscription into a
provider
block we expect to see consistent behaviour regardless of the authentication method used.Actual Behaviour
When authenticating with a Service Principal the initial
terraform plan
throws the following error:We believe this to be correct, as at this point the new Subscription ID is unknown and therefore cannot be correctly set on the provider.
When we authenticate using the Azure CLI the initial
terraform plan
is successful, but we get the following error duringterraform apply
:We believe in this case that the provider is incorrectly accepting a null or empty string input on the
subscription_id
attribute of the provider block and instead defaulting to the currentdefault
Subscription associated with the Azure CLI authenticated user.This is incorrectly giving customers the illusion that they can create a new Subscription and then deploy resources straight into it within a single apply of a single root module.
Steps to Reproduce
terraform init
(should work)terraform plan
(should throw error)terraform init
(should work)terraform plan
(should work)terraform apply
(should create new Subscription and then throw error)terraform apply
(should throw the following error)az account list --refresh
(to refresh the list of available Subscriptions)az account set
to select a subscriptionterraform apply
(should work the second time as the Subscription now exists!)Important Factoids
n/a
References
I believe this issue identifies the underlying reason why customers are trying to create a new Subscription and then deploy resources to it, only to face issues which surface as issues such as:
14863
17541
We have successfully worked around this situation with our lz-vending module, but this was only made possible as the new AzAPI provider has an option to set the
parent_id
for any resource being created,We believe the addition of a new optional
parent_id
argument for all resources in the azurerm provider would allow customers to deploy resources across multiple Subscriptions without hitting this issue (or the need to provide multiple aliased providers).