iterative / terraform-provider-iterative

☁️ Terraform plugin for machine learning workloads: spot instance recovery & auto-termination | AWS, GCP, Azure, Kubernetes
https://registry.terraform.io/providers/iterative/iterative/latest/docs
Apache License 2.0
287 stars 27 forks source link

Add support for instance-permission-set in azure #632

Closed tasdomas closed 1 year ago

tasdomas commented 1 year ago

This PR addresses #559

The instance-permission-set is parsed as a comma-separated list of ARM resource ids in the form '/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{identityName}'.

dacbd commented 1 year ago

You got https://github.com/iterative/terraform-provider-iterative/issues/235 ✔️ 😉

559 would live here: https://github.com/iterative/terraform-provider-iterative/tree/master/task/az

tasdomas commented 1 year ago

@dacbd ah, I see - I guess I misunderstood the relation between tpi and task

dacbd commented 1 year ago

@dacbd ah, I see - I guess I misunderstood the relation between tpi and task

no worries they both needed to be done at some point 👍 we can get two birds with one PR now

dacbd commented 1 year ago

I believe this is also funcationally unique to az, gcp/aws, barring common terminology, allow multiple policies to be attached to a single entity and one entity to be assigned to the instance.

0x2b3bfa0 commented 1 year ago

🔔 @tasdomas

tasdomas commented 1 year ago

bell @tasdomas

Yeah - looking into this

tasdomas commented 1 year ago

The issue is two-fold here:

  1. the command line permission set value was not being propagated to the task (fixed)
  2. azure's permission system is complex - it's not enough to add a UMI with the necessary assigned roles (still working on this one)
tasdomas commented 1 year ago

Test

I'm getting the following error when performing the test described below, although it's probably just me doing something wrong or the az command-line tool behaving unexpectedly. Can you please take a look?

ERROR: Failed to connect to MSI. Please make sure MSI is configured correctly. 
Get Token request returned http error: 400, reason: Bad Request 

So, this took a while to figure out, there were multiple issues:

  1. the permission_set flag was not propagated to the task definition (fixed)
  2. az login --identity will login with the machine's system-assigned identity by default (if it exists), the user-assigned identity needs to be explicitly specified to login with: az login --identity -u {UMI-id}. This is far from being intuitive, but we have limited ways of working around this [1].

[1]

It appears the az login --identity logs in with the system assigned identity, if it exists, or the user assigned identity if only one is set. Currently, when allocating a machine, we maintain the system assigned identity for the scale set, so any user assigned identities need to be specifically logged-in with.

We could assign no system-managed identity if at least one UMI is specified. But this would make permission_set modify the allocation instead of augmenting it.

0x2b3bfa0 commented 1 year ago

Thank you very much for the detailed explanation, @tasdomas! I didn't even realize that the permission_set wasn't being propagated. 🙈

0x2b3bfa0 commented 1 year ago

With regard to the need of specifying an user, I would suggest to simplify their lives a bit more:

  1. Don't accept a comma-separated list of identities; users will have to end up picking a single one to log in.
  2. Expose the permission_set through an environment variable so users can run az login --identity --username=$PERMISSION_SET or similar.

If you deem this reasonable, feel free to open a follow-up issue or pull request.

0x2b3bfa0 commented 1 year ago

we maintain the system assigned identity for the scale set

What is it useful for? Maybe we can just go ahead and leave only the user-managed identity? E.g.

_We could assign no system-managed identity if at least one UMI is specified. But this would make permissionset modify the allocation instead of augmenting it.

🤔

0x2b3bfa0 commented 1 year ago

Merging with (irrelevant) aws tests failing. Resources, resources... 😅