Closed aparna-reji closed 1 year ago
Hi Team ,
I have removed the workspace and retried again. A pattern I have noticed is that , when i set managed_disk_cmk_rotation_to_latest_version_enabled = false
, the cmk creation is working fine on an existing databricks workspace. But when this is set to true , then i get the below error:polling after create or update: internal-error: a polling status of failed should be surfaced as a polling failed error
@aparna-reji, after reading the CLI issue it appears that the issue is with the RP and not the provider. I will reach out to the service team and see if I can get anymore information about the issue which caused this error. 🚀
@aparna-reji : Can we have ARM Resource Id of Databricks Workspace. It will be in this format (/subscriptions/XXXXX/resourceGroups/XXXXX/providers/Microsoft.Databricks/workspaces/XXXXXX)
@aparna-reji:
In trying to track down and debug this error, I tracked down where the error is being returned from (see below code). It appears this is being caused by update call to propagate the tags to all of the connected resources. In you example above did you by any chance happen to update the tags too prior to running the apply
?
// Only call Update (e.g., PATCH) if it is not a new resource and the Tags have changed
// this will cause the updated tags to be propagated to all of the connected
// workspace resources.
// TODO: can be removed once https://github.com/Azure/azure-sdk-for-go/issues/14571 is fixed
if !d.IsNewResource() && d.HasChange("tags") {
workspaceUpdate := workspaces.WorkspaceUpdate{
Tags: expandedTags,
}
err := client.UpdateThenPoll(ctx, id, workspaceUpdate)
if err != nil {
return fmt.Errorf("updating %s Tags: %+v", id, err)
}
}
@msaranga
I think this is related to issue 14571. I checked the issue, and it is still open, so I am assuming we still need to implement the work-around as described in the above Terraform provider code.
@WodansSon Did you mean if the tags were updated manually before apply ? - i don't think that has happened. But from my terraform plan for the run where i have set managed_disk_cmk_rotation_to_latest_version_enabled = false
, i can see that , the tags is shown as will be updated in place and the actual tags are shown to be removed . Please see the same from terraform plan below :
~ tags = {
- "tag1" = "val1" -> null
- "tag2" = "val2" -> null
- "tag3" = "val3" -> null
}
And when i go and check the redeployed databricks workspace, i could see that it has tags.
Also i believe this update call issue is happening for the error performing CreateOrUpdate: unexpected status 400 with error: DiskEncryptionPropertiesRequired: Existing Disk Encryption Properties must be specified in the PUT request.
. But as i mentioned , when i removed the workspace and retried again, when i set managed_disk_cmk_rotation_to_latest_version_enabled = false , the cmk creation is working fine on an existing databricks workspace. But when this is set to true , then i get the below error: polling after create or update: internal-error: a polling status of failed should be surfaced as a polling failed error
. May I know if this polling status failed
error is also due to update call ?.
@aparna-reji, Yes, the polling error is coming back from the update call. In the provider we make the update call then we poll the LRO until it is complete. From the above it appears an error has happened while we were polling for completion of the update call.
@aparna-reji, I am not able to reproduce your issue locally except for the performing CreateOrUpdate: unexpected status 400 with error: DiskEncryptionPropertiesRequired: Existing Disk Encryption Properties must be specified in the PUT request
error which I believe is by design
because once you encrypt the workspace you cannot undue it without destroying it and recreating it again without the encryption.
Just for verification, here are the steps I took based off how I understood your repro case above. I created the workspace and all of the supporting resources like this, so this is the state I believe your workspace was in before you added the DBFS
and Managed Disk
CMK encryption keys:
Once that is provisioned the encryption blade in your Workspace should look like below:
I then added the DBFS
and the Managed Disk
encryption configuration like this:
I then did an apply
and allowed it to deploy the new infrastructure. Once that completes successfully your encryption blade in your Workspace should look like this:
So if I read your repro step correctly we are now in the same state your environment was in before you attempted to revert the CMK keys for managed Disk
and DBFS
, is that correct?
I then removed the managed Disk
CMK configuration values from the azurerm_databricks_workspace
resource:
Which resulted in the error you mentioned in your repro steps above:
I believe this error is by design
, but I will check with the service team to see what the expected behavior is supposed to be once you set the managed disk CMK on a workspace and then attempt to remove it after it has already been set.
Now that I have received, the Existing Disk Encryption Properties must be specified in the PUT request.
error. I then attempted to remove the DBFS
encryption settings from the workspace as below:
I then run an apply
which successfully removes the DBFS
encryption settings, but fails with the same error message as above when it attempt to remove the managed disk CMK settings again. Once the apply
fails my encryption blade for my Workspace looks like this:
As you can see from my attempted repro steps above I was not able to reproduce either of the below reported errors:
│ Error: updating Workspace (Subscription: 'XXXXXXXXXXXXXXXX'
│ Resource Group Name: 'XXXX-rg'
│ Workspace Name: 'databricks') Tags: performing Update: unexpected status 400 with error: ApplicationUpdateFail: Failed to update application: 'databricks', because patch resource group failure.
polling after create or update: internal-error: a polling status of failed should be surfaced as a polling failed error
@aparna-reji, Yes, the polling error is coming back from the update call. In the provider we make the update call then we poll the LRO until it is complete. From the above it appears an error has happened while we were polling for completion of the update call.
May I please know what is an LRO?
@aparna-reji, Sure, it's short for L
ong R
unning O
peration.
@WodansSon @aparna-reji Currently we don't support disabling of CMK for Disk and Managed Services (aka CMK for Notebook). We expect clients to pass CMK information during every workspace update.
@msaranga / @WodansSon Actually i dont want to disable of CMK for Disk and Managed Services .
@aparna-reji, I am not able to reproduce your issue locally except for the
performing CreateOrUpdate: unexpected status 400 with error: DiskEncryptionPropertiesRequired: Existing Disk Encryption Properties must be specified in the PUT request
error which I believe isby design
because once you encrypt the workspace you cannot undue it without destroying it and recreating it again without the encryption.Just for verification, here are the steps I took based off how I understood your repro case above. I created the workspace and all of the supporting resources like this, so this is the state I believe your workspace was in before you added the
DBFS
andManaged Disk
CMK encryption keys:Initial Databricks Workspace configuration Once that is provisioned the encryption blade in your Workspace should look like below:
I then added the
DBFS
and theManaged Disk
encryption configuration like this:Root DBFS and Managed Disk CMK configuration I then did an
apply
and allowed it to deploy the new infrastructure. Once that completes successfully your encryption blade in your Workspace should look like this:So if I read your repro step correctly we are now in the same state your environment was in before you attempted to revert the CMK keys for
managed Disk
andDBFS
, is that correct?I then removed the
managed Disk
CMK configuration values from theazurerm_databricks_workspace
resource:Remove managed disk CMK from configuration Which resulted in the error you mentioned in your repro steps above:
I believe this error is
by design
, but I will check with the service team to see what the expected behavior is supposed to be once you set the managed disk CMK on a workspace and then attempt to remove it after it has already been set.Now that I have received, the
Existing Disk Encryption Properties must be specified in the PUT request.
error. I then attempted to remove theDBFS
encryption settings from the workspace as below:Remove DBFS encryption settings I then run an
apply
which successfully removes theDBFS
encryption settings, but fails with the same error message as above when it attempt to remove the managed disk CMK settings again. Once theapply
fails my encryption blade for my Workspace looks like this:As you can see from my attempted repro steps above I was not able to reproduce either of the below reported errors:
│ Error: updating Workspace (Subscription: 'XXXXXXXXXXXXXXXX' │ Resource Group Name: 'XXXX-rg' │ Workspace Name: 'databricks') Tags: performing Update: unexpected status 400 with error: ApplicationUpdateFail: Failed to update application: 'databricks', because patch resource group failure.
polling after create or update: internal-error: a polling status of failed should be surfaced as a polling failed error
The Initial Databricks Workspace configuration looks correct
Adding Managed Disk CMK configuration also looks correct, except for the fact that when i set managed_disk_cmk_rotation_to_latest_version_enabled = true
, I get error. Thats when i first got
│ Error: updating Workspace (Subscription: 'XXXXXXXXXXXXXXXX'
│ Resource Group Name: 'XXXX-rg'
│ Workspace Name: 'databricks') Tags: performing Update: unexpected status 400 with error: ApplicationUpdateFail: Failed to update application: 'databricks', because patch resource group failure.
and later when i retried by setting managed_disk_cmk_rotation_to_latest_version_enabled = false
, the terraform apply ran successfully and when i retried to set managed_disk_cmk_rotation_to_latest_version_enabled = true
with managed services, dbfs and disk cmk's , got errorpolling after create or update: internal-error: a polling status of failed should be surfaced as a polling failed error
And i didnt attempted to revert the CMK keys for managed Disk and DBFS.
The issue is that i cannot successfully add managed disk cmk to an existing workspace by setting managed_disk_cmk_rotation_to_latest_version_enabled = true
. I can only add managed cmk if i set managed_disk_cmk_rotation_to_latest_version_enabled = false
and keep it like that
@aparna-reji : Can you please fill support ticket or share with us Databricks workspace id or workspace resource ID for troubleshooting
@msaranga Just to double confirm, will you be contacting databricks on the same for getting further troubleshooting details with the workspace ID?
@aparna-reji : I'm from Databricks Engg Team. If you can provide us workspace Resource ID. It will be in this format
/subscriptions/{SubID}/resourceGroups/{RGName}/providers/Microsoft.Databricks/workspaces/{WSName}
@aparna-reji, thanks for the reply. It appears I am still not understanding your repro case. I just attempted to repro what I believe you are describing and I was not able to get the error that you have reported.
Provision a workspace without managed_disk_cmk_rotation_to_latest_version_enabled
or the azurerm_databricks_workspace_customer_managed_key
resource defined in the configuration file.
I wait for step 1 to finish provisioning and then add in the managed_disk_cmk_rotation_to_latest_version_enabled
and the azurerm_databricks_workspace_customer_managed_key
resource into the configuration file. Which generates a plan
as below:
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
+ create
~ update in-place
Terraform will perform the following actions:
# azurerm_databricks_workspace.repro will be updated in-place
~ resource "azurerm_databricks_workspace" "repro" {
id = "/subscriptions/{subscription}/resourceGroups/repro-22394-resources/providers/Microsoft.Databricks/workspaces/databricks-repro"
+ managed_disk_cmk_key_vault_key_id = "https://repro1keyvault.vault.azure.net/keys/repro-disk-certificate/9fcac1dde3ce477caa4ae6c67851bff0"
+ managed_disk_cmk_rotation_to_latest_version_enabled = true
name = "databricks-repro"
tags = {}
# (14 unchanged attributes hidden)
# (1 unchanged block hidden)
}
# azurerm_databricks_workspace_root_dbfs_customer_managed_key.databricks_DBFS will be created
+ resource "azurerm_databricks_workspace_root_dbfs_customer_managed_key" "databricks_DBFS" {
+ id = (known after apply)
+ key_vault_key_id = "https://repro1keyvault.vault.azure.net/keys/repro-dbfs-certificate/7797c6c0b478467992a7df431cd5bbe4"
+ workspace_id = "/subscriptions/{subscription}/resourceGroups/repro-22394-resources/providers/Microsoft.Databricks/workspaces/databricks-repro"
}
Plan: 1 to add, 1 to change, 0 to destroy.
I then apply
the configuration changes which results in:
azurerm_databricks_workspace.repro: Still modifying... [id=/subscriptions/{subscription}...Databricks/workspaces/databricks-repro, 30s elapsed]
azurerm_databricks_workspace.repro: Modifications complete after 36s [id=/subscriptions/{subscription}/resourceGroups/repro-22394-resources/providers/Microsoft.Databricks/workspaces/databricks-repro]
azurerm_databricks_workspace_root_dbfs_customer_managed_key.databricks_DBFS: Creating...
azurerm_databricks_workspace_root_dbfs_customer_managed_key.databricks_DBFS: Still creating... [50s elapsed]
azurerm_databricks_workspace_root_dbfs_customer_managed_key.databricks_DBFS: Creation complete after 50s [id=/subscriptions/{subscription}/resourceGroups/repro-22394-resources/providers/Microsoft.Databricks/workspaces/databricks-repro]
Apply complete! Resources: 1 added, 1 changed, 0 destroyed.
Were my repo steps accurate and a reasonable facsimile to the steps you took in your environment? The one thing that was not clear to me in your issues configuration file was if you created all of the azurerm_key_vault_access_policy
resources needed to successfully enable the CMK scenario. If you look at my configuration files you will see that I have 3 key vault access policies defined (e.g., notebook
, terraform
, and databricks
) that allow Terraform and Databricks permissions to access the Key Vault Keys. I have pulled the resource definitions from the above Step 2 configuration file to make it easier to point out the DBFS key vault access policy, please see below.
resource "azurerm_databricks_workspace_root_dbfs_customer_managed_key" "databricks_DBFS" {
depends_on = [azurerm_key_vault_access_policy.databricks]
workspace_id = azurerm_databricks_workspace.repro.id
key_vault_key_id = azurerm_key_vault_key.databricks_encrypt_dbfs_key.id
}
resource "azurerm_key_vault_access_policy" "databricks" {
depends_on = [azurerm_databricks_workspace.repro]
key_vault_id = azurerm_key_vault.repro.id
tenant_id = azurerm_databricks_workspace.repro.storage_account_identity.0.tenant_id
object_id = azurerm_databricks_workspace.repro.storage_account_identity.0.principal_id
key_permissions = [
"Get",
"GetRotationPolicy",
"UnwrapKey",
"WrapKey",
"Delete",
]
}
NOTE: The azurerm_databricks_workspace_root_dbfs_customer_managed_key
resource is just the azurerm_databricks_workspace_customer_managed_key
resource that has been renamed in my private branch I used to repo this issue.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Is there an existing issue for this?
Community Note
Terraform Version
1.4.5
AzureRM Provider Version
3.50.0
Affected Resource(s)/Data Source(s)
azurerm_databricks_workspace
Terraform Configuration Files
Debug Output/Panic Output
Expected Behaviour
The workspace should have modified properly with additional cmk's.
The resource group mentioned in the error is resource group where the workspace is deployed . And this is not the managed resource group.
I have found similar issue here : https://github.com/Azure/azure-cli/issues/22614 , but cannot find what is the fix applied.
When checked with Microsoft, they confirmed, as the issue occurs before the resources are being deployed indicates that the issue does not occur at the resource provision step and therefore is not related to either Azure or Databricks.
Can someone please let me know if there are some known fixes I can try from my end.
Actual Behaviour
The same code was working before adding the additional two cmk's for managed disk and DBFS. While trying to modify my databricks workspace (re-deploying via Terraform) with additional CMK’s (for DBFS and disk), I am hitting the error.
To find which cmk is giving the trouble, tried removing the extra configs added for managed disk and DBFS CMK's separately.
While trying to redeploy the same workspace after removing the additional code added for managed_disk cmk , getting the below error:
performing CreateOrUpdate: unexpected status 400 with error: DiskEncryptionPropertiesRequired: Existing Disk Encryption Properties must be specified in the PUT request.
While redeploying the same workspace after removing the additional code added for DBFS cmk , gives the same error from start
Referred this ticket and redeploying the same workspace after setting
set managed_disk_cmk_rotation_to_latest_version_enabled = false
, this threw another error saying:polling after create or update: internal-error: a polling status of
failedshould be surfaced as a polling failed error
Steps to Reproduce
terraform apply
Important Factoids
No response
References
https://github.com/hashicorp/terraform-provider-azurerm/blob/main/website/docs/r/databricks_workspace.html.markdown https://github.com/hashicorp/terraform-provider-azurerm/issues/21487