Open ArcTheMaster opened 3 months ago
@ArcTheMaster Sorry for running into this.. Would you mind provide me a minimal Terraform configuration that I can reproduce it on my side (i'm not familiar with how to work with the databrick provider, sorry..)? E.g. A workspace with a spark version data source suffice.
Hello @magodo ,
The use case we have is a bit complex as the workspace relies on Azure resources created prior to its deployment such as: resource group, vnet, subnets and security groups. In some ways you can follow this example there databricks private endpoint example which takes mainly the same bricks we do.
Just for the science here is our code for the workspace only.
# create the databricks workspace
resource "azurerm_databricks_workspace" "workspace" {
custom_parameters {
no_public_ip = true
private_subnet_name = local.azurerm_databricks_private_subnet_name
public_subnet_name = local.azurerm_databricks_public_subnet_name
private_subnet_network_security_group_association_id = local.azurerm_databricks_private_subnet_network_security_group_association_id
public_subnet_network_security_group_association_id = local.azurerm_databricks_public_subnet_network_security_group_association_id
storage_account_name = local.azurerm_databricks_managed_storage_account_name
storage_account_sku_name = "Standard_ZRS"
virtual_network_id = var.vnet_id
}
infrastructure_encryption_enabled = true
lifecycle {
ignore_changes = [
tags
]
}
location = var.vnet_location
managed_resource_group_name = local.azurerm_databricks_managed_resource_group_name
name = local.azurerm_databricks_workspace_name
network_security_group_rules_required = "NoAzureDatabricksRules"
provider = azurerm.azure
public_network_access_enabled = false
resource_group_name = var.vnet_resource_group_name
sku = "premium"
tags = merge(var.application_tags, {
destructible = "true"
environment = var.application_environment
name = local.azurerm_databricks_workspace_name
owner = var.application_owner
project = var.application_project
resource = "${local.azurerm_databricks_tags_prefix}-workspace"
})
timeouts {
create = "15m"
delete = "15m"
update = "15m"
}
}
For the cluster inside the workspace, I suggest you pop up a simple git proxy instance like this:
# get databricks node type configurations
data "databricks_node_type" "smallest" {
local_disk = true
provider = databricks.workspace
}
# get databricks spark version configurations
data "databricks_spark_version" "latest_lts" {
latest = true
long_term_support = true
provider = databricks.workspace
}
# create a databricks cluster for git proxy
resource "databricks_cluster" "git_proxy" {
autotermination_minutes = 0
azure_attributes {
availability = "SPOT_WITH_FALLBACK_AZURE"
first_on_demand = 1
spot_bid_max_price = -1
}
cluster_name = local.databricks_cluster_name
custom_tags = {
"ResourceClass" = "SingleNode"
}
node_type_id = data.databricks_node_type.smallest.id
num_workers = 0
provider = databricks.workspace
spark_version = data.databricks_spark_version.latest_lts.id
spark_conf = {
"spark.databricks.cluster.profile" : "singleNode",
"spark.master" : "local[*]",
}
spark_env_vars = {
"GIT_PROXY_ENABLE_SSL_VERIFICATION" : "False"
"GIT_PROXY_HTTP_PROXY" : var.git_http_proxy,
}
timeouts {
create = "30m"
update = "30m"
delete = "30m"
}
}
# customize the spark cluster with git proxy configuration
resource "databricks_workspace_conf" "this" {
custom_config = {
"enableGitProxy" : true
"gitProxyClusterId" : databricks_cluster.git_proxy.cluster_id
}
provider = databricks.workspace
}
One thing I can also add is everything is modularized for a better code handling and quality. The databricks.workspace provider is set like this but you can get something working like what is shared in the example I was redirecting you to above:
# databricks provider workspace specific configuration
provider "databricks" {
alias = "workspace"
azure_client_id = var.azurerm_client_id
azure_client_secret = var.azurerm_client_secret
azure_tenant_id = var.azurerm_tenant_id
azure_workspace_resource_id = module.databricks_workspace[0].databricks_workspace_resource_id
host = module.databricks_workspace[0].databricks_workspace_url
}
I hope this helps you.
Hi @ArcTheMaster I tried to reproduce the problem by creating a Databricks workspace and reading data "databricks_spark_version"
right after but I have been unsuccesful after 3 attempts:
Hello @magodo are there any updates there ?
Hey @ArcTheMaster, @gerrytan is now looking at this. Could you please review Gerry's test setup and log output to confirm it is the case that expects to reproduce the issue?
Hello @gerrytan @magodo ,
Sorry for the late response I had a pretty busy last week.
To answer Gerry, unfortunately this is not the expected result. I am not sure if this is related to the fact the used configuration does not set the network stack I build prior deploying the workspace. I create:
Note that the metastore is already created and I attach the workspace to it so we are in a Unity Catalog configuration there. In your case I see no metastore attachment which can potentially have a impact on the behavior. Something I also see is the Azure location you're using australiasoutheast
; on my case I deploy inside useast
(maybe it worth testing).
Can you reuse the example I was sharing at the ticket creation ?
# create the databricks workspace
resource "azurerm_databricks_workspace" "workspace" {
custom_parameters {
no_public_ip = true
private_subnet_name = local.azurerm_databricks_private_subnet_name
public_subnet_name = local.azurerm_databricks_public_subnet_name
private_subnet_network_security_group_association_id = local.azurerm_databricks_private_subnet_network_security_group_association_id
public_subnet_network_security_group_association_id = local.azurerm_databricks_public_subnet_network_security_group_association_id
storage_account_name = local.azurerm_databricks_managed_storage_account_name
storage_account_sku_name = "Standard_ZRS"
virtual_network_id = var.vnet_id
}
infrastructure_encryption_enabled = true
lifecycle {
ignore_changes = [
tags
]
}
location = var.vnet_location
managed_resource_group_name = local.azurerm_databricks_managed_resource_group_name
name = local.azurerm_databricks_workspace_name
network_security_group_rules_required = "NoAzureDatabricksRules"
provider = azurerm.azure
public_network_access_enabled = false
resource_group_name = var.vnet_resource_group_name
sku = "premium"
tags = merge(var.application_tags, {
destructible = "true"
environment = var.application_environment
name = local.azurerm_databricks_workspace_name
owner = var.application_owner
project = var.application_project
resource = "${local.azurerm_databricks_tags_prefix}-workspace"
})
timeouts {
create = "15m"
delete = "15m"
update = "15m"
}
}
The other API that breaks all the time after deloying the workspace for the first time is this one:
locals {
system_tables_management_schema = [
"access",
"compute",
"lakeflow",
"lineage",
"marketplace",
"query",
"storage"
]
}
# enable the databricks system tables management
resource "databricks_system_schema" "tables_management" {
for_each = toset(local.system_tables_management_schema)
provider = databricks.workspace
schema = each.key
}
I can make myself available for a live debug if desired, just send me a PM so we can schedule it.
Thanks again.
Hi @ArcTheMaster I'm still trying to come up with a minimal HCL config to reproduce the problem. The last snippet you posted above contains a lot of references to other objects which is not shown (subnet, virtual network).
Note that assuming I can reproduce the problem, and this is some sort of eventual consistency problem (ie: API claimed workspace is ready but it's not), we'll have to submit a bug / feature request to fix / improve the API behaviour. azurerm terraform provider is designed to be a thin layer that communicates with these APIs. It should not have additional polling logic added.
Databricks workspace Create / update API reference: https://learn.microsoft.com/en-us/rest/api/databricks/workspaces/create-or-update?view=rest-databricks-2024-05-01&tabs=HTTP
Hi @gerrytan ,
I did not share all the code in purpose due to the complexity of our use case. I implemented our Databricks deployment with multiple internal modules (dns, vnet, vhub peering...) and it won't have been secure to share everything. It is corporate and private code.
But the example from this link gives you a closer code than mine - https://github.com/hashicorp/terraform-provider-azurerm/blob/main/examples/private-endpoint/databricks/private-endpoint/main.tf
My proposal of a live debug is still valid ! It is very impactful for idempotency and replayability of the code.
Hi @ArcTheMaster yes a live debug will be useful for us, except the timezone diff might not be so kind 😅 . Reach me out at gerry.tan-at-microsoft.com anyway to get this organised.
Meanwhile I've done more work to try to reproduce as you suggested but not yet successful:
Error: cannot read spark version: cannot read data spark version: Unauthorized access to workspace: 1689***********
error at terraform plan
, I think this is due to my network config though 😞 . (main-vnet.tf, main-vnet-terraform-plan.log)Hi Folks, I am trying to create 2x databricks_storage_credential
. I was able to create one resource, but not the second, which failed with the following error:
│ Error: cannot create storage credential: The service at /api/2.1/unity-catalog/storage-credentials is temporarily unavailable.
azurerm_databricks_workspace
but this did not helpUpdate I have reproduced with West US 2.
Hello @rhysmg ,
This is exactly what I have. Not with the resource databricks_storage_credential
but many others. You're fully right, the behavior isn't consistent and vary from time to time.
Adding a delay, or multiple at difference places in the code, does not change the issue. I put a 60 sec delay time post resource azurerm_databricks_workspace
(called by a module I built) and I do another one, 180 sec, just after the module call ends and sends the hand to the remaining TF code.
I have the error with two regions: useast
and useast2
so this does not seem to be region related.
@gerrytan sorry for the delay, I am heading to production at the moment and this consumes a lot of time in my day to day. Let's see how we can make a call working between us. I will send you an email today for a first contact.
Thank you.
Hi Folks, I am trying to create 2x
databricks_storage_credential
. I was able to create one resource, but not the second, which failed with the following error:
│ Error: cannot create storage credential: The service at /api/2.1/unity-catalog/storage-credentials is temporarily unavailable.
- I added a 300s wait time after
azurerm_databricks_workspace
but this did not help- I made one resource dependant on the other and it runs smoothly. The same API doesn't like being called twice so quickly?
- I am seeing this issue with australiaeast. I will try to test with a US location soon.
Update I have reproduced with West US 2.
Hi @rhysmg ,
Just wondering, are you configuring your own vnet, subnets, private endpoints and with custom managed storage account name in your code / use case ? Not relying into a full provisioned config.
Also, just thinking on my side, do you set at the Databricks provider the property http_timeout_seconds. I tried overriding it but unfortunately this does not help the behavior.
Hi @ArcTheMaster,
Yes, I think so... sorry I'm not a Databricks guy, just working on the TF integration side of things. But yes, we are provisioning our own vnet and subnets and use them as follows:
resource "azurerm_databricks_workspace" "xfionlei" {
name = "${var.env_type}-xfi-${var.env_name}-${var.az_location_id}-onlei-dbw"
location = data.azurerm_resource_group.xfionlei.location
resource_group_name = data.azurerm_resource_group.xfionlei.name
managed_resource_group_name = "${var.env_type}-xfi-${var.env_name}-${var.az_location_id}-onlei-dbw-rg"
sku = "premium"
custom_parameters {
no_public_ip = true # SCC (Secure cluster commmunication) enabled
virtual_network_id = data.azurerm_virtual_network.xfionlei.id
public_subnet_name = data.azurerm_subnet.xfionlei_dbw_public.name
private_subnet_name = data.azurerm_subnet.xfionlei_dbw_private.name
public_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.xfionlei_dbw_public.id
private_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.xfionlei_dbw_private.id
}
}
We also create storage accounts for data and config.
azurerm_storage_account
azurerm_storage_container
databricks_storage_credential
databricks_external_location
I haven't set any options in the Databricks provider except the host field with the workspace URL.
Update:
Quick update to say that the depends on workaround I mention above is not working consistently.
@gerrytan, any further update from the Databricks team?
Sorry @rhysmg I haven't heard update from Databricks, let me follow up again with them.
Is there an existing issue for this?
Community Note
Description
Azure support related case ID - 2407310040007808
Link to the resource source - azurerm_databricks_workspace resource
Global context
From time to time several data sources calls are failing after a workspace deployment. Even if Terraform resource source azurerm_databricks_workspace releases a Creation complete after state.
In our use case the identified APIs that fail when a call is made and depend on a full workspace availability:
/api/2.0/clusters/spark-versions
- called by databricks_spark_version data source/api/2.1/unity-catalog/bindings/catalog
- called by databricks_workspace_binding resource/api/2.1/unity-catalog/metastore_summary
- called by databricks_schema resourceTerraform error stack strace
Error mitigation
A use of a sleep at post workspace deployment can be used but this is not a production ready functioning due to the randomness of the workspace real availability. The time can vary a lot, between seconds to minutes.
Another potential way, also not recommended, a use of a
local-exec
provisioner. The official Hasicorp documentation mentions such a code is risky and the resource must integrate the feature natively - local-exec official documentationFeature request
The
azurerm_databricks_workspace
should not return a Creation complete state to Terraform before the APIs are not callable. A regular API call poll should be implemented inside the resource and until a http code like 401 or similar is returned.New or Affected Resource(s)/Data Source(s)
azurerm_databricks_workspace
Potential Terraform Configuration
No response
References
No response