hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.64k forks source link

Support for Azure Machine Learning Compute Instances, Compute Clusters and Inference Clusters as new Terraform resources #11190

Closed gro1m closed 3 years ago

gro1m commented 3 years ago

Community Note

Description

Azure offers the following compute resources to train Machine Learning models:

  1. Compute Instance (allows to run ML trainings in an IDE environment)
  2. Compute Cluster to train big models
  3. Inference clusters on Azure Kubernetes Service to use ML models in a productive service.

A general description can be found here: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-studio. Would be cool if these resources were to provided via Terraform (but probably also a huge effort)

New or Affected Resource(s)

Potential Terraform Configuration

resource "azurerm_machine_learning_compute_instance" "ml_ci" {
workspace_name = "aml-workspace"
compute_name = "my-compute-instance"
container_registry_name = "myacr"
key_vault_name = "mykv"
app_insights_name = "myappinsights"
location = "westeurope"
(sku = "Basic")
tags = {"department": "IT", "cost center" : "mybank"}
vnet_rg_name = "myrg"
vnet_name = "myvnet"
subnet_name = "mysubnet"
remoteLoginPortPublicAccess = "Disabled"
vm_size = "Standard_NC6" #gpu
vm_priority = "Dedicated"
access = [<object_ids of users or managed identites>] #maybe a set would be better, as order is not really necessary here
initial_run_status = "Running" #"Idle", ("Stopped" would probably not make sense)
run_configuration = {source_directory = <project folder>, script = "train.py", compute_target="my-compute", environment = ...}
}

It is of course disputable if the ability to even submit a run to a compute instance can or shall be covered by iac, if it is feasible and if this should be included in the same resource. This refers to the initial_run_status and the run_configuration parameter.

resource "azurerm_machine_learning_compute_cluster" "ml_cc" {
compute_name = "my-compute-cluster"
min_node_count = 0
max_node_count = 2
node_idle_time_before_scale_down = "P0Y0M0DT0H2M"
vm_size = "Standard_NC6" #gpu
vm_priority = "Dedicated"
remoteLoginPortPublicAccess = "Disabled" #corresponds to enable ssh access false
vnet_rg_name = "myrg"
vnet_name = "myvnet"
subnet_name = "mysubnet"
admin_user_name = {<object_ids of admins>}
admin_user_password = {<admin user password>}
access = [<object_ids of users or managed identites>] #maybe a set would be better, as order is not really necessary here
initial_run_status = "Running" #"Idle", ("Stopped" would probably not make sense)
run_configuration = {source_directory = <project folder>, script = "train.py", compute_target="my-compute", environment = ...}
}

It is of course disputable if the ability to even submit a run to a compute instance is still to be covered by iac, if it is feasible and if this should be included in the same resource. This refers again to the initial_run_status and the run_configuration parameter.

resource "azurerm_machine_learning_aks_inference_cluster" "ml_aks_inference" {
workspace_name = "aml-workspace"
web_service_name = "my-aml-webservice"
location = "westeurope"
environment_name = "AzureML-Scikit-learn-0.20.3" #https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-environments, https://docs.microsoft.com/en-us/azure/machine-learning/resource-curated-environments
environment_version = "3"
driver_program = "predict.py"
model_configuration = {name="my_model.pkl", path = "modles/my_model.pkl", framework = "ScikitLearn", framework_version = "0.20.3"}
scoring_timeout_ms = 1
app_insights_enabled = false
auth_enabled = false
aad_auth_enabled = false
compute_name = "ml_aks_inference_cluster"
kubernetes_service = <reference to aks cluster resource or data source>
}

Question: How much of Azure Kubernetes cluster configuration could be reused here, i.e. from azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool resources and data sources? It does not seem sensible to me to redo another AKS redefinition in this resource anymore, but unfortunately I am not sure on to how much would be needed here apart from specific naming and model-specific configurations.

References

gro1m commented 3 years ago

Split into 3 separate issues (for clarity of reference and because resources can be developed independently of each other)

ghost commented 3 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!