pulumi / pulumi-azure-native

Azure Native Provider
Apache License 2.0
123 stars 32 forks source link

Azure Native: Azure Machine Learning Service: Managed Network Provisions #3212

Open Werner-Swart-83 opened 3 months ago

Werner-Swart-83 commented 3 months ago

Hello!

Issue details

We are working with the Azure ML product team and have identified issues when creating Azure ML workspaces using managed isolated networking with Pulumi.

The scenario is as follows:

  1. We create all the storage account, key vaults and ACRs needed for the workspace.
  2. We create the workspace with additional outbound network rules (i.e. using Managed Isolated Networking)
  3. We then provision a compute cluster.

Because this is the fist time we create a compute cluster Azure ML will create the Managed Isolated Network and then join the compute. This is where the problem comes in: There are race conditions, as well as time outs that occur because ARM is trying to create the Managed Isolated Network, the outbound firewall, configuring the firewall as well as join the compute cluster.

MS has advised us to create the Managed Isolated Network separately, once that is created add the outbound rules and then only join the cluster.

For us to do that we need to add the Provision Managed Network endpoint to the v2-config.json file (if I understand the Contributing.md file correctly.)

MS is investigating the issue but it might take a long time before it is fix.

Affected area/feature

Automation API


I am happy to do the work if someone will help guide me :)

thomas11 commented 3 months ago

Hi @Werner-Swart-83, thank you for the detailed report and for your offer to help.

Unfortunately, this endpoint isn't trivial to add. Pulumi operates on a resource-based model, whereas this endpoint is only a side-effecting operation on another resource (the workspace). That's why we don't auto-detect it currently.

Further, this operation of enabling the workspace doesn't seem to have an inverse disable operation. This makes it hard to model the stateful CRUD lifecycle.

We could potentially support this by adding a custom implementation for Workspace but that would take a meaningful amount of work.

At this point, I'd suggest using the REST API or an Azure SDK directly, probably within pulumi.apply(), to start this operation. I'm open to other suggestions, of course.

Werner-Swart-83 commented 3 months ago

Thank you for your reply. The problem is we have to wait for the managed isolated network to be provisioned before we add the additional outbound network rules as well as the compute. We could do the rest call but we would need to poll until we got an answer back that the managed network is up. I don't think .apply is the best place to do that and if run preview we won't get the full extent of the things that will be created.