microsoft / azure_arc

Automated Azure Arc, Edge, and Platform environments
https://aka.ms/ArcJumpstart
Creative Commons Attribution 4.0 International
739 stars 547 forks source link

[Regression]Unable to install a new jumpstart HCIBox #2658

Closed ajaysubramanya86 closed 2 months ago

ajaysubramanya86 commented 2 months ago

Note: For ease of issues and pull requests management and tracking, we kindly ask you to provide a meaningful and concise title to this issue and answer all questions to the best of your ability.

Is your issue related to a Jumpstart scenario, ArcBox, HCIBox, or Agora? Jumpstart HCIBox

Describe the issue or the bug We are unable to install the new jumstart hcibox due to the below error in postdeployment steps. image

To Reproduce We followed the same steps to deploy a new jumpstart HCIBox in the below document. This was working before but recently it started failing. The step is Deploy cluster in Azure portal in the below documentation. https://azurearcjumpstart.io/azure_jumpstart_hcibox/cloud_deployment

Expected behavior Complete post deployment steps succesfully.

Environment summary Latest tools

Have you looked at the Troubleshooting and Logs section?

Screenshots image

Additional context

dkirby-ms commented 2 months ago

This looks like an issue retrieving a secret from keyvault. Did you confirm that permissions on the keyvault are correct as per the prerequisites?

ajaysubramanya86 commented 2 months ago

Thanks Dale. I have added the required pre-requisite permissions Key Vault Administrator and Storage Account Contributor as below but still facing the error image

dkirby-ms commented 2 months ago

Hi @ajaysubramanya86 I have just deployed a fresh HCIBox with AKS cluster and all succeeded as expected.

What happens when you try to "resume deployment" from portal? Do you get the same error?

If you navigate to the keyvault, are you able to see the LocalAdminCredential secret in the keyvault?

Mudassir-23 commented 2 months ago

@dkirby-ms I am also facing this same issue. "resume deployment" is resulting in same error. And Yes, when i navigate to key vault i could see the LocalAdminCredential secret.

FYI, The parameter autoDeployClusterResource & autoUpgradeClusterResource were set to false during deployment.

Where can i find logs to get more details about the error?

Mudassir-23 commented 2 months ago

@dkirby-ms after a closer look at exception, i think somewhere the secret name it is trying to fetch is getting modified. For example: The secret name in vault is LocalAdminCredential and as shown in the exception it is trying to fetch hciboxcluster-LocalAdminCredential- . It is trying to fetch secret as hciboxcluster-${actualSecretName}- .

This is the same case for all secrets in vault, not sure where is it getting modified.

After modifying secrets to new name format & redeploy it has continued with deployment, any idea where this transformation might be happening.

cc: @ajaysubramanya86

dkirby-ms commented 2 months ago

HCIBox itself does not touch or modify anything regarding the secrets in the vault. The vault creation, secrets creation/naming are all handled as part of the Azure Stack HCI cloud deployment process. Renaming the secrets is not part of normal operation or deployment. HCIBox is not modifying the secrets in any way.

I am running yet another deployment this morning and I will check the names of the secrets to see if I notice any changes or anomalies.

Please let me know how the deployment retry proceeds and we can keep this thread updated.

Also, you can find deployment logs from the cloud deployment side of things on the AZSHOST1 c:\MASLogs folder.

dkirby-ms commented 2 months ago

One thing that comes to mind is that keyvaults are not permanently deleted unless explictly purged. I am wondering if somehow we are reusing an old keyvault in your HCIBox deployment rather than creating a fresh one at deploy time.

If you can create a fresh new HCIBox environment and be sure you have purged/perma-deleted all previous HCIBox keyvaults that would be one thing to review.

Mudassir-23 commented 2 months ago

Azure Stack HCI cloud deployment process got completed successfully. Will try to configure AKS and test.

The issue is not with secrets in vault, they are as expected. The issues is with secret names used to fetch during deployment.

Eg: In this exception Exception One or more errors occurred. at: at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) at Microsoft.AzureStack.Solution.Deploy.LCMController.ArcCommunication.ActionPlanController.ExecuteRequest(Request request) Base Exception: Failed to fetch secret LocalAdminCredential from key vault. Ex:Could not make the secret fetch call for https://hcibox-kv-86ac1.vault.azure.net/secrets/hciboxcluster-LocalAdminCredential-?

The secret name being fetched is hciboxcluster-LocalAdminCredential- where as the secret in vault is LocalAdminCredential

dkirby-ms commented 2 months ago

I've now reproduced the issue also. This is something upstream happening with a change to the LCM process. I will continue to investigate if there is something we can do on the HCIBox side to mitigate it.

dkirby-ms commented 2 months ago

Azure Stack HCI cloud deployment process got completed successfully. Will try to configure AKS and test.

The issue is not with secrets in vault, they are as expected. The issues is with secret names used to fetch during deployment.

Eg: In this exception Exception One or more errors occurred. at: at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) at Microsoft.AzureStack.Solution.Deploy.LCMController.ArcCommunication.ActionPlanController.ExecuteRequest(Request request) Base Exception: Failed to fetch secret LocalAdminCredential from key vault. Ex:Could not make the secret fetch call for https://hcibox-kv-86ac1.vault.azure.net/secrets/hciboxcluster-LocalAdminCredential-?

The secret name being fetched is hciboxcluster-LocalAdminCredential- where as the secret in vault is LocalAdminCredential

I tried to "Try again" several times and noticed that the secret name attempting to be feteched was initially LocalAdminCredential, which was present in my vault yet the error still happened. On a subsequent "Try again" the name was now hciboxcluster-LocalAdminCredential which was not present.

dkirby-ms commented 2 months ago

This is caused by a change to the RP which no longer works with the old version we are using. The fix is submitted and being tested now.