microsoft / azure_arc

Automated Azure Arc, Edge, and Platform environments
https://aka.ms/ArcJumpstart
Creative Commons Attribution 4.0 International
739 stars 546 forks source link

Azure Stack HCI ARM template deplyoment fails - Failing at cluster validation #2504

Closed mukundansampath closed 5 months ago

mukundansampath commented 6 months ago

Is your issue related to a Jumpstart scenario, ArcBox, HCIBox, or Agora? HCIBox

Describe the issue or the bug Following https://azurearcjumpstart.com/azure_jumpstart_hcibox/cloud_deployment after the azure arc machines(AzSHOST1, AzSHOST2) are created from the power shell script successfully I tried deploying the ARM template and validate the deployment. It fails consistently(screenshot attached)

To Reproduce

  1. Copy the ARM template(hci.json and hic,parameters.json) from the VS code
  2. Deploy in the required resource group with the validate option
  3. The ARM template deployment fails with error -

Exception encountered while adding node to cluster [Resource validation failed. Details: [{"Code":"ValidationFailed","Message":"Arc extensions installed on Arc Machine /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackhci-rg/providers/Microsoft.HybridCompute/machines/AzSHOST1 are while required list of mandatory arc extensions are TelemetryAndDiagnostics, DeviceManagementExtension, LcmController","Target":null,"Details":null},{"Code":"ValidationFailed","Message":"Arc extensions installed on Arc Machine /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackhci-rg/providers/Microsoft.HybridCompute/machines/AzSHOST2 are while required list of mandatory arc extensions are TelemetryAndDiagnostics, DeviceManagementExtension, LcmController","Target":null,"Details":null},{"Code":"ValidationFailed","Message":"Arc machines validation failed for /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackhci-rg/providers/Microsoft.HybridCompute/machines/AzSHOST1, /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackhci-rg/providers/Microsoft.HybridCompute/machines/AzSHOST2","Target":null,"Details":null}].] at [ at Microsoft.AzureStackHCI.ResourceProvider.Services.EdgeDeviceManager.ValidateNodesAsync(ResourceCollection1 machines, IList1 arcMachineIDs, String parentClusterResourceId) in C:__w\1\s\src\rp\Services\EdgeDeviceManager.cs:line 404 at Microsoft.AzureStackHCI.ResourceProvider.Services.ClusterNodeManager.AddARCNodesToCluster(ClusterDeploymentWorkItem workItem) in C:__w\1\s\src\rp\Services\ClusterNodeManager.cs:line 114] (Code: NotSpecified)

Validate that the extensions are present in the both Azure Arc machines.

Expected behavior It should not fail. I tried twice and I am hitting the same issue

Environment summary bicep % az --version azure-cli 2.59.0

Have you looked at the Troubleshooting and Logs section? yes

Screenshots

Screenshot 2024-04-18 at 2 49 14 PM

Also uploading the HCI client box logs - Logs.zip

Additional context

mukundansampath commented 6 months ago

Also attaching the extensions present in the 2 arc machines....

Screenshot 2024-04-18 at 4 01 30 PM Screenshot 2024-04-18 at 4 01 09 PM
janegilring commented 6 months ago

@mukundansampath 1) Could you check which version of the OS is installed on the client VM?

image

One way to check is to run Start->Run->winver

image

2) Did you also retry the deployment after you verified that the extensions are installed? (reason for asking is that the LcmManager can take some time)

mukundansampath commented 6 months ago

Hi @janegilring - Thanks for taking a look -

  1. Screenshot 2024-04-18 at 10 30 37 PM
  2. I tried multiple times after the extensions are installed.

janegilring commented 6 months ago

@mukundansampath After inspecting the logs it seems to be an issue in this section:

#################################################################################################
# - Add required RBAC permission required for the service principal to deploy Azure Stack HCI
#################################################################################################

INFO: Loaded Module 'Az.Authorization'
INFO: Loaded Module 'Az.Accounts'
INFO: Loaded Module 'Az.MSGraph'
New-AzRoleAssignment : Operation returned an invalid status code 'Forbidden'
At C:\HCIBox\HCIBoxLogonScript.ps1:61 char:5
+     New-AzRoleAssignment -RoleDefinitionName "Key Vault Administrator ...
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : CloseError: (:) [New-AzRoleAssignment], ErrorResponseException
    + FullyQualifiedErrorId : Microsoft.Azure.Commands.Resources.NewAzureRoleAssignmentCommand
Account                              SubscriptionName           TenantId                             Environment
-------                              ----------------           --------                             -----------
60e34a37-09f5-4e80-be90-c3d7686cae19 hcs-mcw-azure-subscription b39138ca-3cee-4b4a-a4d6-cd83d9dd62f0 AzureCloud
New-AzRoleAssignment : Operation returned an invalid status code 'Forbidden'
At C:\HCIBox\HCIBoxLogonScript.ps1:61 char:5
+     New-AzRoleAssignment -RoleDefinitionName "Key Vault Administrator ...
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : CloseError: (:) [New-AzRoleAssignment], ErrorResponseException
    + FullyQualifiedErrorId : Microsoft.Azure.Commands.Resources.NewAzureRoleAssignmentCommand

New-AzRoleAssignment : Operation returned an invalid status code 'Forbidden'
At C:\HCIBox\HCIBoxLogonScript.ps1:66 char:5
+     New-AzRoleAssignment -RoleDefinitionName "Storage Account Contrib ...
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : CloseError: (:) [New-AzRoleAssignment], ErrorResponseException
    + FullyQualifiedErrorId : Microsoft.Azure.Commands.Resources.NewAzureRoleAssignmentCommand
New-AzRoleAssignment : Operation returned an invalid status code 'Forbidden'
At C:\HCIBox\HCIBoxLogonScript.ps1:66 char:5
+     New-AzRoleAssignment -RoleDefinitionName "Storage Account Contrib ...
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : CloseError: (:) [New-AzRoleAssignment], ErrorResponseException
    + FullyQualifiedErrorId : Microsoft.Azure.Commands.Resources.NewAzureRoleAssignmentCommand

New-AzRoleAssignment : Operation returned an invalid status code 'Forbidden'
At C:\HCIBox\Generate-ARM-Template.ps1:14 char:1
+ New-AzRoleAssignment -ObjectId $env:spnProviderId -RoleDefinitionName ...
    + CategoryInfo          : CloseError: (:) [New-AzRoleAssignment], ErrorResponseException
    + FullyQualifiedErrorId : Microsoft.Azure.Commands.Resources.NewAzureRoleAssignmentCommand
New-AzRoleAssignment : Operation returned an invalid status code 'Forbidden'
At C:\HCIBox\Generate-ARM-Template.ps1:14 char:1
+ New-AzRoleAssignment -ObjectId $env:spnProviderId -RoleDefinitionName ...

This might indicate that the Service Principal was assigned the Contributor RBAC role rather than Owner, hence it does not have permissions to assign roles. Given that assigning permissions failed, it is likely the root cause for the ARM validation deployment fails, due to missing permissions for the Resource Provider to read the status of the node extensions.

In order to resolve the issue without a redeployment of HCIBox, you may have a look at the command which assigns the permissions here and here.

And assign them either by running the commands manually or assigning the permissions manually via the portal.

mukundansampath commented 6 months ago

Thanks for taking a look @janegilring The validation is still failing with the exact same error even after giving owner access to the service principal and retrying the deployment from scratch. Logs_21Apr2024.zip We are not hitting the permission issue(Logs attached)

In our organization there is one more person who setup the azure stack HCI simulator with a different subscription. I see there are 2 managed identities names for example when I search for the access using IAM for my resource group - one is his and one is mine.(see the date created - Mine is today)

Screenshot 2024-04-21 at 9 21 16 PM

Could it be causing this conflict?

Screenshot 2024-04-21 at 9 19 16 PM

Thanks,

janegilring commented 5 months ago

@mukundansampath Thanks for the update, the new deployment logs looks good.

Could it be causing this conflict?

I do not think that should be an issue, as we are also deploying multiple instances of HCIBox in the same tenant for development purposes without problems.

One thing to check/verify: Could you go to the resource group where HCIBox are deployed and check whether the Microsoft.AzureStackHCI Resource Provider is listed with the RBAC role Azure Connected Machine Resource Manager?

image

mukundansampath commented 5 months ago

@janegilring No. That role assignment was indeed missing

Screenshot 2024-04-23 at 9 22 04 AM

But the SP has that role -

Screenshot 2024-04-23 at 9 22 58 AM

Added this role assignment manually from the portal. That fixed the issue. Deployment validation has succeeded. Proceeding with the deployment. Why was the role assignment missing?

janegilring commented 5 months ago

@mukundansampath Glad the deployment validation succeeded. Looking at the deployment logs you shared, the role was assigned multiple times:

RoleAssignmentName : edbf51ac-e1eb-4eb9-921a-1b39b53eb688
RoleAssignmentId   : /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackh
                     ci-rg/providers/Microsoft.Authorization/roleAssignments/edbf51ac-e1eb-4eb9-921
                     a-1b39b53eb688
Scope              : /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackh
                     ci-rg
DisplayName        :
SignInName         :
RoleDefinitionName : Azure Connected Machine Resource Manager
RoleDefinitionId   : f5819b54-e033-4d82-ac66-4fec3cbf3f4c
ObjectId           : 05c316fb-a3fb-41e0-afce-3c7df0f00959
ObjectType         : Unknown
CanDelegate        : False
Description        :
ConditionVersion   :
Condition          :

RoleAssignmentName : 5a007dc9-0319-4566-92ab-124d813d093d
RoleAssignmentId   : /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackh
                     ci-rg/providers/Microsoft.Authorization/roleAssignments/5a007dc9-0319-4566-92a
                     b-124d813d093d
Scope              : /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackh
                     ci-rg
DisplayName        :
SignInName         :
RoleDefinitionName : Azure Connected Machine Resource Manager
RoleDefinitionId   : f5819b54-e033-4d82-ac66-4fec3cbf3f4c
ObjectId           : e8a84f30-03ea-4cd0-b49f-67c9f6ae8d3e
ObjectType         : Unknown
CanDelegate        : False
Description        :
ConditionVersion   :
Condition          :

RoleAssignmentName : 25dfe2b8-d442-40ef-9621-686b8061078b
RoleAssignmentId   : /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackh
                     ci-rg/providers/Microsoft.Authorization/roleAssignments/25dfe2b8-d442-40ef-962
                     1-686b8061078b
Scope              : /subscriptions/0456a995-2102-4130-82c9-6c9548ec5105/resourceGroups/msam-stackh
                     ci-rg
DisplayName        : vmw-hcs-principal-msampathkumar
SignInName         :
RoleDefinitionName : Azure Connected Machine Resource Manager
RoleDefinitionId   : f5819b54-e033-4d82-ac66-4fec3cbf3f4c
ObjectId           : d9397bc1-4321-45b8-afd5-a83130df497a
ObjectType         : ServicePrincipal
CanDelegate        : False
Description        :
ConditionVersion   :
Condition          :

If you run az ad sp list --display-name "Microsoft.AzureStackHCI Resource Provider", is the id of the outputted object the same as one of the above assignments?

mukundansampath commented 5 months ago

@janegilring Here is the output. Dont see it - msampathkumar@msampathkuJC2D1 createImage % az ad sp list --display-name "Microsoft.AzureStackHCI Resource Provider" [ { "accountEnabled": true, "addIns": [], "alternativeNames": [], "appDescription": null, "appDisplayName": "Microsoft.AzureStackHCI Resource Provider", "appId": "1412d89f-b8a8-4111-b4fd-e82905cbd85d", "appOwnerOrganizationId": "f8cdef31-a31e-4b4a-93e4-5f571e91255a", "appRoleAssignmentRequired": false, "appRoles": [], "applicationTemplateId": null, "createdDateTime": "2021-09-13T05:16:57Z", "deletedDateTime": null, "description": null, "disabledByMicrosoftStatus": null, "displayName": "Microsoft.AzureStackHCI Resource Provider", "homepage": null, "id": "7f47539e-70c8-4ff7-8e78-ac6a386a946b", "info": { "logoUrl": null, "marketingUrl": null, "privacyStatementUrl": null, "supportUrl": null, "termsOfServiceUrl": null }, "keyCredentials": [], "loginUrl": null, "logoutUrl": null, "notes": null, "notificationEmailAddresses": [], "oauth2PermissionScopes": [], "passwordCredentials": [], "preferredSingleSignOnMode": null, "preferredTokenSigningKeyThumbprint": null, "replyUrls": [], "resourceSpecificApplicationPermissions": [], "samlSingleSignOnSettings": null, "servicePrincipalNames": [ "1412d89f-b8a8-4111-b4fd-e82905cbd85d", "https://sea-azurestackhci-rp.azurewebsites.net" ], "servicePrincipalType": "Application", "signInAudience": "AzureADMultipleOrgs", "tags": [], "tokenEncryptionKeyId": null, "verifiedPublisher": { "addedDateTime": null, "displayName": null, "verifiedPublisherId": null } } ]

janegilring commented 5 months ago

@mukundansampath Thanks, could you check your parameters-file and see whether it contains the value 7f47539e-70c8-4ff7-8e78-ac6a386a946b for the spnProviderId parameter?

mukundansampath commented 5 months ago

@janegilring No dont see it either - "spnProviderId": { "value": "d9397bc1-4321-45b8-afd5-a83130df497a" },

Screenshot 2024-04-24 at 12 35 43 PM Screenshot 2024-04-24 at 12 35 16 PM
janegilring commented 5 months ago

@mukundansampath Then I think we have found the culprit.

If I understand correctly, you provided the value for the SPN used for the deployment for the parameter spnProviderId. The correct value for this should be the id 7f47539e-70c8-4ff7-8e78-ac6a386a946b.

The guidance for populating this parameter is available here:

image

mukundansampath commented 5 months ago

My bad. Phew. This took me around circles. Closing the bug. Thanks for the patient help janegilring

mukundansampath commented 5 months ago

Closing - Bad input for the spnProviderId

mukundansampath commented 5 months ago

One suggestion though @janegilring. Can we rename the param from spnProviderId to something else like stackHciProviderId?

janegilring commented 5 months ago

@mukundansampath Thanks for the suggestion. We will give it some thought for future iterations, but I suspect it might be considered a breaking change.