microsoft / azure_arc

Automated Azure Arc, Edge, and Platform environments
https://aka.ms/ArcJumpstart
Creative Commons Attribution 4.0 International
733 stars 532 forks source link

Arc HCI Box Stalls on Step 7/10 #2464

Closed matthansen0 closed 3 months ago

matthansen0 commented 3 months ago

Is your issue related to a Jumpstart scenario, ArcBox, HCIBox, or Agora?

HCIBox using azd.

Describe the issue or the bug

Since early last week when deploying the solution, it gets stuck on the initial script inside the HCIBox-Client on step 7/10 when starting the vm-router. I've let it sit up to 6 hours and nothing happens. I've attached a screenshot below also showing that system utilization is essentially nothing.

I've been able to make it pass this step by opening AzSMGMT in hyper-v, then logging into the double nested vm-router, but then the script will get stuck on other power on operations and does not complete.

I've both azd down'd the environment and tried again (on different days), and fully deleted the git clone to wipe out the environment and redeployed (on multiple days) and it still gets stuck on the same step.

To Reproduce

Deploy Jumpstart HCI Box using azd, then connect to the VM and let the automated script run.

Expected behavior

Environment summary

Using Azure Cloud Shell, same steps have worked since the 23H2 release, but started having this issue last week.

Have you looked at the Troubleshooting and Logs section?

Screenshots image

image

dkirby-ms commented 3 months ago

There was a regression last week with the underlying VM image used that seemed to be causing very slow network speeds on the nested VMs. This could be related to that. A fix was pushed late last week to pin the VM host image to a prior release and the slowness issue was resolved. See #2462

matthansen0 commented 3 months ago

There was a regression last week with the underlying VM image used that seemed to be causing very slow network speeds on the nested VMs. This could be related to that. A fix was pushed late last week to pin the VM host image to a prior release and the slowness issue was resolved. See #2462

Hm, it looks like your fix was merged into main, but I did two separate deployments today with new git clone code and ran into the same issue both times.

matthansen0 commented 3 months ago

The issue persists in my deployments with the code fix image

dkirby-ms commented 3 months ago

Hi @matthansen0

I just did a new deploy to eastus from the main branch and do not experience this issue. Note that this step does normally take several minutes to complete and the script will seem like it is paused on "Starting the vm-router" step.

In the past, we have seen instances where users click on the PowerShell window during execution, which pauses the PowerShell script (this is indicated by a visible solid cursor in the PowerShell window). The script will remain paused until manually resumed by the user. Is it possible that is what is happening here?

matthansen0 commented 3 months ago

Hi @matthansen0

I just did a new deploy to eastus from the main branch and do not experience this issue. Note that this step does normally take several minutes to complete and the script will seem like it is paused on "Starting the vm-router" step.

In the past, we have seen instances where users click on the PowerShell window during execution, which pauses the PowerShell script (this is indicated by a visible solid cursor in the PowerShell window). The script will remain paused until manually resumed by the user. Is it possible that is what is happening here?

I deployed yesterday and today and both times it continued to freeze at the same point, I left it for about 3 hours yesterday and about 45 minutes today, both times though once I opened the MzSMGMT VM in Hyper-V and then logged into the vm-router the script would automatically continue and then completed the rest of the steps successfully.