hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.64k forks source link

azurerm_kubernetes_cluster_extension does not work with "Microsoft.AzureML.Kubernetes" #24547

Open AhrazA opened 9 months ago

AhrazA commented 9 months ago

Is there an existing issue for this?

Community Note

Terraform Version

1.3.3

AzureRM Provider Version

3.80.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster_extension

Terraform Configuration Files

resource "azurerm_kubernetes_cluster_extension" "ml_extension" {
  name           = "machine-learning"
  cluster_id     = azurerm_kubernetes_cluster.this.id
  extension_type = "Microsoft.AzureML.Kubernetes"

  configuration_settings = {
    "enableInference"              = "True",
    "inferenceRouterServiceType"   = "LoadBalancer",
    "internalLoadBalancerProvider" = "azure",
    "sslSecret"                    = "secretname",
    "sslCname"                     = "some.example.local"
    "inferenceRouterHA"            = "False",
  }
}

Debug Output/Panic Output

Output from Terraform:

Error: creating Scoped Extension (Scope: "/subscriptions/XXXXXX/resourceGroups/XXXXX/providers/Microsoft.ContainerService/managedClusters/XXXXX"
Extension Name: "machine-learning"): polling after Create: polling failed: the Azure API returned the following error:

Status: "Failed"
Code: "ExtensionOperationFailed"
Message: "The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Unable to create/update Kubernetes resources for the extension : Recommendation Please check that there are no policies blocking the resource creation/update for the extension. Please refer to this URL for more details on this error: aka.ms/bing : InnerError [release machine-learning failed, and has been uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: Troubleshooting: https://aka.ms/arcmltsg."
Activity Id: ""

---

API Response:

----[start]----
{"id":"removed","name":"XXXXXXX","status":"Failed","error":{"code":"ExtensionOperationFailed","message":"The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Unable to create/update Kubernetes resources for the extension : Recommendation Please check that there are no policies blocking the resource creation/update for the extension. Please refer to this URL for more details on this error: aka.ms/bing : InnerError [release machine-learning failed, and has been uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit: https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: Troubleshooting: https://aka.ms/arcmltsg.","additionalInfo":[]}}
-----[end]-----

  with module.aks.azurerm_kubernetes_cluster_extension.ml_extension,
  on ../ml.tf line 85, in resource "azurerm_kubernetes_cluster_extension" "ml_extension":
  85: resource "azurerm_kubernetes_cluster_extension" "ml_extension" {

creating Scoped Extension (Scope: "/subscriptions/XXXXX/resourceGroups/XXXXX.ContainerService/managedClusters/XXXXX"
Extension Name: "machine-learning"): polling after Create: polling failed: the Azure API returned the following error:

Status: "Failed"
Code: "ExtensionOperationFailed"
Message: "The extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Unable to create/update Kubernetes resources for the extension : Recommendation Please check that there are no policies blocking the resource creation/update for the extension. Please refer to this URL for more details on this error: aka.ms/bing : InnerError [release machine-learning failed, and has
been uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit:
https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: Troubleshooting: https://aka.ms/arcmltsg."
Activity Id: ""

---

API Response:

----[start]----
{"id":"/subscriptions/XXXXX/resourceGroups/XXXXXX/providers/Microsoft.ContainerService/ManagedClusters/XXXXXX/providers/Microsoft.KubernetesConfiguration/extensions/machine-learning/operations/XXXXXX","name":"XXXXXX","status":"Failed","error":{"code":"ExtensionOperationFailed","message":"The
extension operation failed with the following error:  Error: [ InnerError: [Helm installation failed : Unable to create/update Kubernetes resources for the extension : Recommendation Please check that there
are no policies blocking the resource creation/update for the extension. Please refer to this URL for more details on this error: aka.ms/bing : InnerError [release machine-learning failed, and has been
uninstalled due to atomic being set: failed pre-install: pod healthcheck failed]]] occurred while doing the operation : [Create] on the config, For general troubleshooting visit:
https://aka.ms/k8s-extensions-TSG, For more application specific troubleshooting visit: Troubleshooting: https://aka.ms/arcmltsg.","additionalInfo":[]}}
-----[end]-----

============================================================================

Output of `kubectl describe configmap -n azureml arcml-healthcheck`:

Name:         arcml-healthcheck
Namespace:    azureml
Labels:       <none>
Annotations:  <none>

Data
====
reports-pre-install:
----

Report:
Duration: 478.005507ms
Date: 2024-01-18 12:34:37.610110532 +0000 UTC m=+0.629805183
K8sDistribution:  
K8sInfrastructure: 
CorrelationId: XXXXXX
TroubleShootingGuide: https://aka.ms/arcmltsg

Status: Failed Name: instancetype ErrorCode: E45006 ErrorMessage: Failed to validate or update instancetype
Status: Passed Name: clusterresource
Status: Passed Name: info
Status: Passed Name: service
Status: Failed Name: sslcheck ErrorCode: E40007 ErrorMessage: Invalid SSL settings
Status: Passed Name: helmchart

reports-pre-delete:
----

Report:
Duration: 568.318303ms
Date: 2024-01-18 12:35:07.808843885 +0000 UTC m=+0.741099689
K8sDistribution:  
K8sInfrastructure: 
CorrelationId: XXXXXX
TroubleShootingGuide: https://aka.ms/arcmltsg

Status: Passed Name: clusterresource
Status: Passed Name: info
Status: Passed Name: service
Status: Failed Name: sslcheck ErrorCode: E40007 ErrorMessage: Invalid SSL settings
Status: Passed Name: helmchart

BinaryData
====

Events:  <none>

Expected Behaviour

The extension "Microsoft.AzureML.Kubernetes" should be successfully applied to the managed AKS cluster.

Actual Behaviour

The ML extension is deployed, but fails to validate the SSL settings.

The same configuration works via the CLI:

az k8s-extension create \
  --name "machine-learning" \
  --extension-type Microsoft.AzureML.Kubernetes \
  --config inferenceRouterServiceType=LoadBalancer enableInference=True internalLoadBalancerProvider=azure sslSecret=secretname sslCname="some.example.local" \
  --cluster-type managedClusters \
  --cluster-name XXXXXXX \
  --resource-group XXXXXXX --scope cluster

Steps to Reproduce

  1. Provision managed AKS cluster via terraform
  2. Apply the azurerm_kubernetes_cluster_extension with extension_type Microsoft.AzureML.Kubernetes and with the sslSecret option

Important Factoids

No response

References

It seems very similar to this issue: https://github.com/hashicorp/terraform-provider-azurerm/issues/15011

I tried to workaround this issue using the azapi_resource approach as described in that issue, however I ran into the same problem.

Doing some digging into the generated healthcheck-config configmap output and comparing that to how it functions when it "works" with the CLI, I was able to identify several parameters in the resulting configuration that were not the same.

It seems as though the configuration processing applied via az-cli here (https://github.com/Azure/azure-cli-extensions/blob/cf183a48b210ff6e7b33af806d4604d9d8c25fdd/src/k8s-extension/azext_k8s_extension/partner_extensions/AzureMLKubernetes.py#L138) is not being applied when done via Terraform.

jmasengeshomsft commented 8 months ago

I observed the same thing. In addition, tf installs the relayserver on AKS which was not supposed to be the case according to the documentation. I guessed the cluster type was not being addressed.

kaluzaaa commented 8 months ago

@AhrazA @jmasengeshomsft I managed use azurerm_kubernetes_cluster_extension for ml using this code:

resource "azurerm_kubernetes_cluster_extension" "aml" {
  name           = "aml"
  cluster_id     = azurerm_kubernetes_cluster.aks.id
  extension_type = "microsoft.azureml.kubernetes"
  configuration_settings = {
    "enableTraining"                                        = "false"
    "enableInference"                                       = "true"
    "inferenceRouterServiceType"                            = "LoadBalancer"
    "internalLoadBalancerProvider"                          = "azure"
    "allowInsecureConnections"                              = "true"
    "InferenceRouterHA"                                     = "false"
    "cluster_name"                                          = azurerm_kubernetes_cluster.aks.id
    "domain"                                                = "${azurerm_resource_group.aml.location}.cloudapp.azure.com"
    "location"                                              = azurerm_resource_group.aml.location
    "jobSchedulerLocation"                                  = azurerm_resource_group.aml.location
    "cluster_name_friendly"                                 = azurerm_kubernetes_cluster.aks.name
    "servicebus.enabled"                                    = "false"
    "relayserver.enabled"                                   = "false"
    "nginxIngress.enabled"                                  = "true"
    "clusterId"                                             = azurerm_kubernetes_cluster.aks.id
    "prometheus.prometheusSpec.externalLabels.cluster_name" = azurerm_kubernetes_cluster.aks.id
  }
}