microsoft / azure_arc

Automated Azure Arc, Edge, and Platform environments
https://aka.ms/ArcJumpstart
Creative Commons Attribution 4.0 International
743 stars 549 forks source link

Custom deployment failed #2530

Closed potejasw closed 5 months ago

potejasw commented 6 months ago

**Is your issue related to a Jumpstart scenario, , HCIBox

Describe the issue or the bug OperationTimeout , No updates received from device for operation

{"code":"ArcOperationTimedOut","target":"/subscriptions/3f3df5ee-74f3-4aa8-83d2-fa6558733b45/resourceGroups/PCTHCIBOX-rg/providers/Microsoft.HybridCompute/machines/AzSHOST1","message":"OperationTimeout , No updates received from device for operation: [providers/microsoft.azurestackhci/locations/EASTUS/operationStatuses/98438b4f-e55d-4580-9649-82be41c323d9*E803284F08085E1E43A65AF9A5F9852A3E3D07A9C81917EE350B85B2BFC1CABF?api-version=2023-08-01-preview] beyond timeout of [600000] ms"}

Raw error: { "code": "ArcOperationTimedOut", "target": "/subscriptions/3f3df5ee-74f3-4aa8-83d2-fa6558733b45/resourceGroups/PCTHCIBOX-rg/providers/Microsoft.HybridCompute/machines/AzSHOST1", "message": "OperationTimeout , No updates received from device for operation: [providers/microsoft.azurestackhci/locations/EASTUS/operationStatuses/98438b4f-e55d-4580-9649-82be41c323d9*E803284F08085E1E43A65AF9A5F9852A3E3D07A9C81917EE350B85B2BFC1CABF?api-version=2023-08-01-preview] beyond timeout of [600000] ms" }

To Reproduce

Expected behavior Complete the custom deployment of HCI box.

Environment summary Az HCI 23H2

Have you looked at the Troubleshooting and Logs section?

Screenshots

image

image

Additional context HCI deployment.

potejasw commented 6 months ago

Hi Team, you have any update or any engineer assigned?

likamrat commented 6 months ago

Hi @potejasw, thx for opening the issue. We will have someone assigned to this in a few days as we currently getting ready for a few major releases. Thx for your patience and understanding.

katriendg commented 6 months ago

I also wanted to add that trying out the HCIBox Jumpstart using CLI option, is failing after a few hours running the New-HCIBoxCluster.ps1 script at logon. Step 10 fails upon Validation, and the error message in the portal is the following (note I removed my resource names)

{"code":"UpdateDeploymentSettingsDataFailed","message":"Deployment Settings validation failed.","details":
[{"code":"UpdateDeploymentSettingsDataFailed","target":"/subscriptions/[.......]/resourceGroups/[.......]/providers/Microsoft.AzureStackHCI/clusters/hciboxcluster","message":"Failed to create deployment settings. \nValidation status is {Status=Error, Steps={Name=Error, Description=Error executing Request: Validate, FullStepIndex=0, StartTimeUtc=5/7/2024 4:17:38 PM, EndTimeUtc=NA, Status=Error, Exception=Exception: One or more errors occurred. at:   at 
System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)\r\n   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)\r\n   at Microsoft.AzureStack.Solution.Deploy.LCMController.ArcCommunication.ActionPlanController.ExecuteRequest(Request request) in 
C:\\__w\\1\\s\\src\\LCMController\\ArcCommunication\\Source\\LCMController.ArcCommunication\\ActionPlanController.cs:line 379 Base Exception: Failed to fetch secret:LocalAdminCredential
 from Key Vault https://[[.......]].vault.azure.net with:Response status code does not indicate success: 404 (Not Found). at:  
  at Microsoft.AzureStack.Solution.Deploy.LCMController.ArcCommunication.ActionPlanController.GetSecret(String keyVaultUri, String secretName) in C:\\__w\\1\\s\\src\\LCMController\\ArcCommunication\\Source\\LCMController.ArcCommunication\\ActionPlanController.cs:line 296\r\n   at Microsoft.AzureStack.Solution.Deploy.LCMController.ArcCommunication.ActionPlanController.<InitAnswerFileAndSecrets>d__9.MoveNext() in 
C:\\__w\\1\\s\\src\\LCMController\\ArcCommunication\\Source\\LCMController.ArcCommunication\\ActionPlanController.cs:line 253\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.AzureStack.Solution.Deploy.LCMController.ArcCommunication.ActionPlanController.<ExecuteMessagesFromResourceProvider>d__5.MoveNext() in C:\\__w\\1\\s\\src\\LCMController\\ArcCommunication\\Source\\LCMController.ArcCommunication\\ActionPlanController.cs:line 94, Steps=null}}. \nDeployment Status is {Status=, Steps=null}"}]}

The secret LocalAdminCredential does exist in the Key Vault.

potejasw commented 6 months ago

How do I try using CLI option. We have below two options to create cluster. Arm template and Azure portal.

katriendg commented 6 months ago

@potejasw To clarify I meant the Azure CLI tutorial (which deploys through Bicep/Arm) and not the Azure Developer CLI one. https://azurearcjumpstart.io/azure_jumpstart_hcibox/deployment_az

potejasw commented 6 months ago

@katriendg The HCI box deployment completed. I can login to the VM. But I have an issue in creating the Cluster from ARM template.

potejasw commented 6 months ago

@katriendg You got any new to me?

janegilring commented 6 months ago

@potejasw Could you give the following a try?

On the HCI nodes, navigate to C:\ProgramData\GuestConfig\extension_logs\Microsoft.Edge.DeviceManagementExtension\ and check the DeviceManagementExtension.log and state.json for any error messages. If none are found, rename the EdgeDevice.txt file to EdgeDevice.old, which will regenerate the latest device information and push it up to the cloud within 15 minutes

potejasw commented 5 months ago

@janegilring I tried the above action plan. Re-tried to deploy the cluster using the ARM template and its failed with below error. image

{"code":"UpdateDeploymentSettingsDataFailed","message":"Deployment Settings validation failed.","details":[{"code":"UpdateDeploymentSettingsDataFailed","target":"/subscriptions/xxxxxxxxxxxxxxx/resourceGroups/xxxxxx-rg/providers/Microsoft.AzureStackHCI/clusters/hciboxcluster","message":"Failed to create deployment settings. \nValidation status is {Status=Error, Steps={Name=SetRegistrationParametersInECEForCloudDeployment, Description=Set Registration parameters in ECE for cloud deployment., FullStepIndex=0, StartTimeUtc=2024-05-20T09:33:31, EndTimeUtc=2024-05-20T09:33:46, Status=Success, Exception=, Steps=}, {Name=InvokeEnvironmentChecker, Description=Invoke Environment Checker action plan., FullStepIndex=1, StartTimeUtc=2024-05-20T09:33:46, EndTimeUtc=2024-05-20T09:33:50, Status=Error, Exception=System.Collections.Generic.List`1[System.String], Steps=}}. \nDeployment Status is {Status=, Steps=null}"}]}

janegilring commented 5 months ago

@potejasw Thanks for the update. At this point I would suggest deleting the resource group, run git pull in your local Jumpstart-folder and try a fresh deployment.

janegilring commented 5 months ago

@potejasw Did you need further assistance or can we close this issue?

potejasw commented 5 months ago

@janegilring Please close this. I think the issue was HCIbox deployment was half baked.

I was able to create a new HCIbox. This solved my purpose.

robmcfadden81 commented 3 days ago

Running into same issue. Is a complete reinstall the only resolution here?

{"code":"UpdateDeploymentSettingsDataFailed","message":"Deployment Settings validation failed.","details":[{"code":"UpdateDeploymentSettingsDataFailed","target":"/subscriptions/74c145ff-befd-465e-884d-2cc841b0d9f1/resourceGroups/hci-rg/providers/Microsoft.AzureStackHCI/clusters/hciboxcluster","message":"Failed to create deployment settings. \nValidation status is {Status=Error, Steps={Name=SetRegistrationParametersInECEForCloudDeployment, Description=Set Registration parameters in ECE for cloud deployment., FullStepIndex=10, StartTimeUtc=2024-11-11T20:09:31, EndTimeUtc=2024-11-11T20:09:42, Status=Success, Exception=, Steps=}, {Name=SetObservabilityNodeNameAndVersion, Description=Set assembly version reg key for telemetry config, FullStepIndex=20, StartTimeUtc=2024-11-11T20:09:42, EndTimeUtc=2024-11-11T20:09:46, Status=Success, Exception=, Steps=}, {Name=InvokeEnvironmentChecker, Description=Invoke Environment Checker action plan., FullStepIndex=30, StartTimeUtc=2024-11-11T20:09:46, EndTimeUtc=2024-11-11T20:09:51, Status=Error, Exception=System.Collections.Generic.List`1[System.String], Steps=}}. \nDeployment Status is {Status=, Steps=null}"}]}