azurerm_databricks_workspace resource source adding support of the backend api availability state

ArcTheMaster commented 3 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Community Note

Please vote on this issue by adding a :thumbsup: reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave comments along the lines of "+1", "me too" or "any updates", they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment and review the contribution guide to help.

Description

Azure support related case ID - 2407310040007808

Link to the resource source - azurerm_databricks_workspace resource

Global context

From time to time several data sources calls are failing after a workspace deployment. Even if Terraform resource source azurerm_databricks_workspace releases a Creation complete after state.

module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Creating...
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [10s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [20s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [30s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [40s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [50s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [1m0s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [1m10s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [1m20s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [1m30s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [1m40s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [1m50s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [2m0s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [2m10s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [2m20s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [2m30s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [2m40s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Still creating... [2m50s elapsed]
module.databricks_workspace[0].azurerm_databricks_workspace.workspace: Creation complete after 2m51s [id=/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/databricks-prod/providers/Microsoft.Databricks/workspaces/ws-databricks-prod]

In our use case the identified APIs that fail when a call is made and depend on a full workspace availability:

/api/2.0/clusters/spark-versions - called by databricks_spark_version data source
/api/2.1/unity-catalog/bindings/catalog - called by databricks_workspace_binding resource
/api/2.1/unity-catalog/metastore_summary - called by databricks_schema resource

All APIs may be impacted, the enlisted ones are those identified in our business context.

Terraform error stack strace

╷
│ Error: cannot read spark version: cannot read data spark version: The service at /api/2.0/clusters/spark-versions is temporarily unavailable. Please try again later. [TraceId: 00-3efd1d02ea7361f91a0986acf72ee525-55e2e68128e83653-01]
│
│   with module.databricks_git_proxy[0].data.databricks_spark_version.latest_lts,
│   on .terraform/modules/databricks_git_proxy/data.tf line 8, in data "databricks_spark_version" "latest_lts":
│    8: data "databricks_spark_version" "latest_lts" {
│
╵
ERRO[0877] terraform invocation failed in /home/pbeauvois/gitlab/databricks/databricks/live/prod/databricks/.terragrunt-cache/ys6c2ZeeKg0yYzt7foZisH6x0Uc/Zn9KPkhUr3GVb-8wB-whrAmAE1A  error=[/home/pbeauvois/gitlab/databricks/databricks/live/prod/databricks/.terragrunt-cache/ys6c2ZeeKg0yYzt7foZisH6x0Uc/Zn9KPkhUr3GVb-8wB-whrAmAE1A] exit status 1 prefix=[/home/pbeauvois/gitlab/databricks/databricks/live/prod/databricks]
ERRO[0877] 1 error occurred:
        * [/home/pbeauvois/gitlab/databricks/databricks/live/prod/databricks/.terragrunt-cache/ys6c2ZeeKg0yYzt7foZisH6x0Uc/Zn9KPkhUr3GVb-8wB-whrAmAE1A] exit status 1

Error mitigation

A use of a sleep at post workspace deployment can be used but this is not a production ready functioning due to the randomness of the workspace real availability. The time can vary a lot, between seconds to minutes.

# create a time sleep resource to wait 600 seconds post workspace creation
resource "time_sleep" "wait_600_seconds_post_ws_creation" {
  create_duration = "600s"

  depends_on = [
    module.databricks_workspace
  ]
}

Another potential way, also not recommended, a use of a local-exec provisioner. The official Hasicorp documentation mentions such a code is risky and the resource must integrate the feature natively - local-exec official documentation

Feature request

The azurerm_databricks_workspace should not return a Creation complete state to Terraform before the APIs are not callable. A regular API call poll should be implemented inside the resource and until a http code like 401 or similar is returned.

New or Affected Resource(s)/Data Source(s)

azurerm_databricks_workspace

Potential Terraform Configuration

No response

References

No response

magodo commented 3 months ago

@ArcTheMaster Sorry for running into this.. Would you mind provide me a minimal Terraform configuration that I can reproduce it on my side (i'm not familiar with how to work with the databrick provider, sorry..)? E.g. A workspace with a spark version data source suffice.

ArcTheMaster commented 3 months ago

Hello @magodo ,

The use case we have is a bit complex as the workspace relies on Azure resources created prior to its deployment such as: resource group, vnet, subnets and security groups. In some ways you can follow this example there databricks private endpoint example which takes mainly the same bricks we do.

Just for the science here is our code for the workspace only.

# create the databricks workspace
resource "azurerm_databricks_workspace" "workspace" {

  custom_parameters {
    no_public_ip                                         = true
    private_subnet_name                                  = local.azurerm_databricks_private_subnet_name
    public_subnet_name                                   = local.azurerm_databricks_public_subnet_name
    private_subnet_network_security_group_association_id = local.azurerm_databricks_private_subnet_network_security_group_association_id
    public_subnet_network_security_group_association_id  = local.azurerm_databricks_public_subnet_network_security_group_association_id
    storage_account_name                                 = local.azurerm_databricks_managed_storage_account_name
    storage_account_sku_name                             = "Standard_ZRS"
    virtual_network_id                                   = var.vnet_id
  }
  infrastructure_encryption_enabled = true

  lifecycle {
    ignore_changes = [
      tags
    ]
  }

  location = var.vnet_location
  managed_resource_group_name           = local.azurerm_databricks_managed_resource_group_name
  name                                  = local.azurerm_databricks_workspace_name
  network_security_group_rules_required = "NoAzureDatabricksRules"
  provider                              = azurerm.azure
  public_network_access_enabled         = false
  resource_group_name                   = var.vnet_resource_group_name
  sku                                   = "premium"

  tags = merge(var.application_tags, {
    destructible = "true"
    environment  = var.application_environment
    name         = local.azurerm_databricks_workspace_name
    owner        = var.application_owner
    project      = var.application_project
    resource     = "${local.azurerm_databricks_tags_prefix}-workspace"
  })

  timeouts {
    create = "15m"
    delete = "15m"
    update = "15m"
  }
}

For the cluster inside the workspace, I suggest you pop up a simple git proxy instance like this:

# get databricks node type configurations
data "databricks_node_type" "smallest" {
  local_disk = true
  provider   = databricks.workspace
}

# get databricks spark version configurations
data "databricks_spark_version" "latest_lts" {
  latest            = true
  long_term_support = true
  provider          = databricks.workspace
}

# create a databricks cluster for git proxy
resource "databricks_cluster" "git_proxy" {
  autotermination_minutes = 0
  azure_attributes {
    availability       = "SPOT_WITH_FALLBACK_AZURE"
    first_on_demand    = 1
    spot_bid_max_price = -1
  }

  cluster_name = local.databricks_cluster_name
  custom_tags = {
    "ResourceClass" = "SingleNode"
  }

  node_type_id  = data.databricks_node_type.smallest.id
  num_workers   = 0
  provider      = databricks.workspace
  spark_version = data.databricks_spark_version.latest_lts.id

  spark_conf = {
    "spark.databricks.cluster.profile" : "singleNode",
    "spark.master" : "local[*]",
  }

  spark_env_vars = {
    "GIT_PROXY_ENABLE_SSL_VERIFICATION" : "False"
    "GIT_PROXY_HTTP_PROXY" : var.git_http_proxy,
  }

  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }
}

# customize the spark cluster with git proxy configuration
resource "databricks_workspace_conf" "this" {
  custom_config = {
    "enableGitProxy" : true
    "gitProxyClusterId" : databricks_cluster.git_proxy.cluster_id
  }
  provider = databricks.workspace
}

One thing I can also add is everything is modularized for a better code handling and quality. The databricks.workspace provider is set like this but you can get something working like what is shared in the example I was redirecting you to above:

# databricks provider workspace specific configuration
provider "databricks" {
  alias                       = "workspace"
  azure_client_id             = var.azurerm_client_id
  azure_client_secret         = var.azurerm_client_secret
  azure_tenant_id             = var.azurerm_tenant_id
  azure_workspace_resource_id = module.databricks_workspace[0].databricks_workspace_resource_id
  host                        = module.databricks_workspace[0].databricks_workspace_url
}

I hope this helps you.

gerrytan commented 3 months ago

Hi @ArcTheMaster I tried to reproduce the problem by creating a Databricks workspace and reading data "databricks_spark_version" right after but I have been unsuccesful after 3 attempts:

The HCL config I used: https://gist.github.com/gerrytan/fc95bad900cbe3e6ef3c071c1132cc70
The output of terraform apply: https://gist.github.com/gerrytan/6cc46dbc259d56d1c38c4b998d504fd7

ArcTheMaster commented 3 months ago

Hello @magodo are there any updates there ?

magodo commented 3 months ago

Hey @ArcTheMaster, @gerrytan is now looking at this. Could you please review Gerry's test setup and log output to confirm it is the case that expects to reproduce the issue?

ArcTheMaster commented 2 months ago

Hello @gerrytan @magodo ,

Sorry for the late response I had a pretty busy last week.

To answer Gerry, unfortunately this is not the expected result. I am not sure if this is related to the fact the used configuration does not set the network stack I build prior deploying the workspace. I create:

private DNS zone
non routable VNet (which includes three subnets - host, container and private endpoint)

Note that the metastore is already created and I attach the workspace to it so we are in a Unity Catalog configuration there. In your case I see no metastore attachment which can potentially have a impact on the behavior. Something I also see is the Azure location you're using australiasoutheast; on my case I deploy inside useast (maybe it worth testing).

Can you reuse the example I was sharing at the ticket creation ?

# create the databricks workspace
resource "azurerm_databricks_workspace" "workspace" {

  custom_parameters {
    no_public_ip                                         = true
    private_subnet_name                                  = local.azurerm_databricks_private_subnet_name
    public_subnet_name                                   = local.azurerm_databricks_public_subnet_name
    private_subnet_network_security_group_association_id = local.azurerm_databricks_private_subnet_network_security_group_association_id
    public_subnet_network_security_group_association_id  = local.azurerm_databricks_public_subnet_network_security_group_association_id
    storage_account_name                                 = local.azurerm_databricks_managed_storage_account_name
    storage_account_sku_name                             = "Standard_ZRS"
    virtual_network_id                                   = var.vnet_id
  }
  infrastructure_encryption_enabled = true

  lifecycle {
    ignore_changes = [
      tags
    ]
  }

  location = var.vnet_location
  managed_resource_group_name           = local.azurerm_databricks_managed_resource_group_name
  name                                  = local.azurerm_databricks_workspace_name
  network_security_group_rules_required = "NoAzureDatabricksRules"
  provider                              = azurerm.azure
  public_network_access_enabled         = false
  resource_group_name                   = var.vnet_resource_group_name
  sku                                   = "premium"

  tags = merge(var.application_tags, {
    destructible = "true"
    environment  = var.application_environment
    name         = local.azurerm_databricks_workspace_name
    owner        = var.application_owner
    project      = var.application_project
    resource     = "${local.azurerm_databricks_tags_prefix}-workspace"
  })

  timeouts {
    create = "15m"
    delete = "15m"
    update = "15m"
  }
}

The other API that breaks all the time after deloying the workspace for the first time is this one:

locals {
  system_tables_management_schema = [
    "access",
    "compute",
    "lakeflow",
    "lineage",
    "marketplace",
    "query",
    "storage"
  ]
}

# enable the databricks system tables management
resource "databricks_system_schema" "tables_management" {
  for_each = toset(local.system_tables_management_schema)

  provider = databricks.workspace
  schema   = each.key
}

I can make myself available for a live debug if desired, just send me a PM so we can schedule it.

Thanks again.

gerrytan commented 2 months ago

Hi @ArcTheMaster I'm still trying to come up with a minimal HCL config to reproduce the problem. The last snippet you posted above contains a lot of references to other objects which is not shown (subnet, virtual network).

Note that assuming I can reproduce the problem, and this is some sort of eventual consistency problem (ie: API claimed workspace is ready but it's not), we'll have to submit a bug / feature request to fix / improve the API behaviour. azurerm terraform provider is designed to be a thin layer that communicates with these APIs. It should not have additional polling logic added.

Databricks workspace Create / update API reference: https://learn.microsoft.com/en-us/rest/api/databricks/workspaces/create-or-update?view=rest-databricks-2024-05-01&tabs=HTTP

ArcTheMaster commented 2 months ago

Hi @gerrytan ,

I did not share all the code in purpose due to the complexity of our use case. I implemented our Databricks deployment with multiple internal modules (dns, vnet, vhub peering...) and it won't have been secure to share everything. It is corporate and private code.

But the example from this link gives you a closer code than mine - https://github.com/hashicorp/terraform-provider-azurerm/blob/main/examples/private-endpoint/databricks/private-endpoint/main.tf

My proposal of a live debug is still valid ! It is very impactful for idempotency and replayability of the code.

gerrytan commented 2 months ago

Hi @ArcTheMaster yes a live debug will be useful for us, except the timezone diff might not be so kind 😅 . Reach me out at gerry.tan-at-microsoft.com anyway to get this organised.

Meanwhile I've done more work to try to reproduce as you suggested but not yet successful:

I deployed a more elaborate setup: a publicly accessible workspace, and a cluster right after that, and it worked fine (main-simple.tf, main-simple-terraform-apply.log)
I did the equivalent on a workspace behind vnet (per example), but I get Error: cannot read spark version: cannot read data spark version: Unauthorized access to workspace: 1689*********** error at terraform plan, I think this is due to my network config though 😞 . (main-vnet.tf, main-vnet-terraform-plan.log)

rhysmg commented 2 months ago

Hi Folks, I am trying to create 2x databricks_storage_credential. I was able to create one resource, but not the second, which failed with the following error:

│ Error: cannot create storage credential: The service at /api/2.1/unity-catalog/storage-credentials is temporarily unavailable.

I added a 300s wait time after azurerm_databricks_workspace but this did not help
I made one resource dependant on the other and it runs smoothly. The same API doesn't like being called twice so quickly?
I am seeing this issue with australiaeast. I will try to test with a US location soon.

Update I have reproduced with West US 2.

ArcTheMaster commented 2 months ago

Hello @rhysmg ,

This is exactly what I have. Not with the resource databricks_storage_credential but many others. You're fully right, the behavior isn't consistent and vary from time to time.

Adding a delay, or multiple at difference places in the code, does not change the issue. I put a 60 sec delay time post resource azurerm_databricks_workspace (called by a module I built) and I do another one, 180 sec, just after the module call ends and sends the hand to the remaining TF code. I have the error with two regions: useast and useast2 so this does not seem to be region related.

@gerrytan sorry for the delay, I am heading to production at the moment and this consumes a lot of time in my day to day. Let's see how we can make a call working between us. I will send you an email today for a first contact.

Thank you.

ArcTheMaster commented 2 months ago

Hi Folks, I am trying to create 2x databricks_storage_credential. I was able to create one resource, but not the second, which failed with the following error:

│ Error: cannot create storage credential: The service at /api/2.1/unity-catalog/storage-credentials is temporarily unavailable.

I added a 300s wait time after azurerm_databricks_workspace but this did not help

I made one resource dependant on the other and it runs smoothly. The same API doesn't like being called twice so quickly?

I am seeing this issue with australiaeast. I will try to test with a US location soon.

Update I have reproduced with West US 2.

Hi @rhysmg ,

Just wondering, are you configuring your own vnet, subnets, private endpoints and with custom managed storage account name in your code / use case ? Not relying into a full provisioned config.

Also, just thinking on my side, do you set at the Databricks provider the property http_timeout_seconds. I tried overriding it but unfortunately this does not help the behavior.

rhysmg commented 1 month ago

Hi @ArcTheMaster,

Yes, I think so... sorry I'm not a Databricks guy, just working on the TF integration side of things. But yes, we are provisioning our own vnet and subnets and use them as follows:

resource "azurerm_databricks_workspace" "xfionlei" {
  name                        = "${var.env_type}-xfi-${var.env_name}-${var.az_location_id}-onlei-dbw"
  location                    = data.azurerm_resource_group.xfionlei.location
  resource_group_name         = data.azurerm_resource_group.xfionlei.name
  managed_resource_group_name = "${var.env_type}-xfi-${var.env_name}-${var.az_location_id}-onlei-dbw-rg"
  sku                         = "premium"
  custom_parameters {
    no_public_ip                                         = true # SCC (Secure cluster commmunication) enabled
    virtual_network_id                                   = data.azurerm_virtual_network.xfionlei.id
    public_subnet_name                                   = data.azurerm_subnet.xfionlei_dbw_public.name
    private_subnet_name                                  = data.azurerm_subnet.xfionlei_dbw_private.name
    public_subnet_network_security_group_association_id  = azurerm_subnet_network_security_group_association.xfionlei_dbw_public.id
    private_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.xfionlei_dbw_private.id
  }
}

We also create storage accounts for data and config. azurerm_storage_account azurerm_storage_container databricks_storage_credential databricks_external_location

I haven't set any options in the Databricks provider except the host field with the workspace URL.

gerrytan commented 1 month ago

Update:

Terraform provider calls ARM Databricks Create Or Update API
The implementation of this ARM API calls Databricks API
ARM API has received a successful confirmation of workspace creation from Databricks API. It is not appropriate from design point of view to implement additional polling logic on the ARM API. We will forward this investigation to Databricks to see if they can fix their workspace creation logic.

rhysmg commented 3 weeks ago

Quick update to say that the depends on workaround I mention above is not working consistently.

@gerrytan, any further update from the Databricks team?

gerrytan commented 3 weeks ago

Sorry @rhysmg I haven't heard update from Databricks, let me follow up again with them.

hashicorp / terraform-provider-azurerm