hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.51k stars 4.6k forks source link

Support for Scheduling and Inline setup script for Azure ML Compute Instance #15539

Open RoozbehBandpey opened 2 years ago

RoozbehBandpey commented 2 years ago

Community Note

Description

Azure ML offers scheduling and setup scripts for compute instance creation. Our current workaround is to apply these changes with post-provisioning scripts. ARM templates can be found here: https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.machinelearningservices/machine-learning-compute-create-computeinstance Would be great to have the possibility of doing so with Terraform.

New or Affected Resource(s)

Potential Terraform Configuration

resource "azurerm_machine_learning_compute_instance" "compute_instance" {
  name                          = "ci-dev"
  location                      = azurerm_resource_group.mlw.location
  machine_learning_workspace_id = azurerm_machine_learning_workspace.mlw.id
  virtual_machine_size          = "STANDARD_D4_V2"
  subnet_resource_id = var.training_subnet_id

  schedule = [
    {
      day = "mon"
      time = "08:00"
      timeZone = "UTC"
      action = "start"
    },
    {
      day = "mon"
      time = "20:00"
      timeZone = "UTC"
      action = "stop"
    },
    {
      day = "tue"
      time = "08:00"
      timeZone = "UTC"
      action = "start"
    },
    {
      day = "tue"
      time = "18:00"
      timeZone = "UTC"
      action = "stop"
    },
    ...
  ]
  setup_script = {
    inlineScript = "pip install xyz"
    arguments = [
      "arg1",
      "arg2",
      "arg3",
     ...
    ]
  }
}
ms-henglu commented 2 years ago

@RoozbehBandpey , Hi, thank you for taking the time to report this issue.

I checked the features requested, sadly there're some issues supporting them:

  1. setupScripts: there'll be some changes introduced in v2 API and not decided yet, so the service team wants us to hold on supporting this feature.
  2. schedules: I didn't find the definition of this in https://github.com/Azure/azure-rest-api-specs/blob/main/specification/machinelearningservices/resource-manager/Microsoft.MachineLearningServices/stable/2021-07-01/machineLearningServices.json, I'll contact service team whether there're something missing.
chamilad commented 2 years ago

@ms-henglu Thanks! Did you get any response from the team? This would become an important feature since most teams would want to manage their compute costs in ML.

IIUC the API to wrap the schedules section should be https://docs.microsoft.com/en-us/rest/api/azureml/compute/update-schedules.

ms-henglu commented 2 years ago

Hi @chamilad ,

Sorry for late reply. The schedules only exists in api-version 2021-03-01-preview(currently azurerm uses 2021-07-01), and not added to stable api-version yet, so we can't support this feature.

chamilad commented 2 years ago

Thanks @ms-henglu ! I'll keep a look out for updates.

MrWhiteABEX commented 1 year ago

Is there work in progress on this issue? If I understand machine_learning_compute_instance_resource.go correct then the api-version is now 2022-05-01. So this should now be possible. This issue is a showstopper for us. We need to recreate the compute instances frequently in order to update them. So all manual changes are frequently lost.

marrrcin commented 1 year ago

@ms-henglu any update on this?

tgalentinesr-Insight commented 10 months ago

This is still an issue, can we please get an update on this?

mstetka-fr commented 8 months ago

Can the upstream/microsoft tag be removed from this issue? These properties have been supported by the REST API for several months now.

Uranium2 commented 4 months ago

Can someone give us an update on this feature?

I think you don't realise how much these feature can help. Saving costs, electricity cost for Microsoft. Custom configuration for Compute instances (env variable for people having to deal with https_proxy and no_proxy).

In my CICD, I can create Daily 10-20 computes instances. Yes I can use the UI for the schedules/IDLE (it's a waste of time). But I have to create a documentation for all users that need to edit /etc/environment to add https_proxy and no_proxy env variable. Because we need them to make a pip install of any packages or any debian packages. Even to download the Vscode Server when we want to connect in remote to the compute instance. Data Scientists are not devops, and most of them are not used to edit a protected file and could delete important configurations or files with sudo.

Uranium2 commented 1 month ago

I think I found a workaround in Terraform. We could use azapi_resource in terraform to managed the compute instance ressource.

https://learn.microsoft.com/en-us/azure/templates/microsoft.machinelearningservices/workspaces/computes?pivots=deployment-language-terraform https://registry.terraform.io/providers/Azure/azapi/latest/docs/resources/azapi_resource

It seems over complicated to do it like this. But feasable. Not sure if we need to redefine all properies or if we can only define and setupScripts.

Also this does not cover Idle shutdown. Here is a response from Microsoft Support Team:

for query on setup script in terraform i do see we have the startup and setupscript in teraform now: 

https://learn.microsoft.com/en-us/azure/templates/microsoft.machinelearningservices/workspaces/computes?pivots=deployment-language-terraform

here is the sample script to setup proxy azureml-examples/setup/setup-ci/jupyter-proxy.sh at main · Azure/azureml-examples (github.com)
https://github.com/Azure/azureml-examples/blob/main/setup/setup-ci/jupyter-proxy.sh

Regarding idleshutdown our team confirmed that they have a work-item for it but its not going to be in this release cycle.