mattmcspirit / azurestack

Azure Stack Resources
80 stars 41 forks source link

App Service Deployment Failed #128

Closed JonParvez closed 4 years ago

JonParvez commented 4 years ago

Description When trying to execute your script, the service providers and the app service prerequisites completed but app service deployment failed.

[692C:1A40][2020-06-16T19:37:51]i000: [Websites]: Token Expires On: 2020-06-16 20:37:45Z [692C:1A40][2020-06-16T19:37:51]i000: [Websites]: GET: https://adminmanagement.local.azurestack.external/subscriptions/*****/providers/Microsoft.Compute/locations/local/publishers/MicrosoftWindowsServer/artifacttypes/vmimage/offers/WindowsServer/skus?api-version=2016-03-30 [692C:599C][2020-06-16T19:37:52]e000: [Websites]: System.AggregateException: One or more errors occurred. ---> System.AggregateException: One or more errors occurred. ---> System.Net.WebException: The remote server returned an error: (404) Not Found. at Microsoft.Web.Hosting.RetryPolicy.d311.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.Web.Hosting.SingleInstaller.Csm.AzureClient.d129.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.Web.Hosting.SingleInstaller.Csm.AzureClient.<GetPlatformImages>d__69.MoveNext() --- End of inner exception stack trace --- at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at System.Threading.Tasks.Task1.get_Result() at Microsoft.Web.Hosting.SingleInstaller.Views.AzureStackPlatformImageViewModel.LoadPlatformImages() at Microsoft.Web.Hosting.SingleInstaller.Views.AzureStackPlatformImageViewModel..ctor(WizardStepState state) at Microsoft.Web.Hosting.SingleInstaller.Logic.SilentDeployLogic.<Run>d__42.MoveNext() --- End of inner exception stack trace --- at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) at System.Threading.Tasks.Task.Wait() at Microsoft.Web.Hosting.SingleInstaller.WixBA.ProcessCommandLine() at Microsoft.Web.Hosting.SingleInstaller.WixBA.Run() ---> (Inner Exception #0) System.AggregateException: One or more errors occurred. ---> System.Net.WebException: The remote server returned an error: (404) Not Found. at Microsoft.Web.Hosting.RetryPolicy.<ExecuteWebActionAsync>d__311.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.Web.Hosting.SingleInstaller.Csm.AzureClient.<CallAzureResourceManager>d__129.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.Web.Hosting.SingleInstaller.Csm.AzureClient.d__69.MoveNext() --- End of inner exception stack trace --- at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at System.Threading.Tasks.Task1.get_Result() at Microsoft.Web.Hosting.SingleInstaller.Views.AzureStackPlatformImageViewModel.LoadPlatformImages() at Microsoft.Web.Hosting.SingleInstaller.Views.AzureStackPlatformImageViewModel..ctor(WizardStepState state) at Microsoft.Web.Hosting.SingleInstaller.Logic.SilentDeployLogic.d42.MoveNext() ---> (Inner Exception #0) System.Net.WebException: The remote server returned an error: (404) Not Found. at Microsoft.Web.Hosting.RetryPolicy.d311.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.Web.Hosting.SingleInstaller.Csm.AzureClient.d129.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult() at Microsoft.Web.Hosting.SingleInstaller.Csm.AzureClient.d69.MoveNext()

Hardware As per the minimum requirements of Azure stack.

The Error log file is attached here for your troubleshooting. Thanks for your service.

AppSvcLog0616-193736.txt

mattmcspirit commented 4 years ago

Hi Jon, did you encounter any issues with the Server 2016 Full Image step?

If you look in the portal - Region Management - Compute - VM Images, does your Windows Server 2016 Datacenter (Not Server Core) image show there? This error:

[692C:1A40][2020-06-16T19:37:51]i000: [Websites]: GET: https://adminmanagement.local.azurestack.external/subscriptions/*****/providers/Microsoft.Compute/locations/local/publishers/MicrosoftWindowsServer/artifacttypes/vmimage/offers/WindowsServer/skus?api-version=2016-03-30 [692C:599C][2020-06-16T19:37:52]e000: [Websites]: System.AggregateException: One or more errors occurred. ---> System.AggregateException: One or more errors occurred. ---> System.Net.WebException: The remote server returned an error: (404) Not Found

Suggests it couldn't reach the image, but i guess it could have just been transient.

Have you rerun the script since this error? if not, perhaps run it again and see if it bypasses next time.

Thanks, Matt

JonParvez commented 4 years ago

Hi Matt, thanks for your quick reply. I appreciate your great work.

I have checked the VM images created on the portal, I can see the Windows Server 2016 Datacenter (Not Server Core) image.

We have rerun the script again and after 10 hours, find out the new issue now. The error showing now is ->

[3FD4:3134][2020-06-17T16:12:59]i000: [Websites]: TryLogAppServiceResourceStatus Ends [3FD4:4AF8][2020-06-17T16:12:59]e000: [Websites]: System.AggregateException: One or more errors occurred. ---> System.Exception: Deployment Failed. Refer deployment logs for more details at Microsoft.Web.Hosting.SingleInstaller.Logic.SilentDeployLogic.d42.MoveNext() --- End of inner exception stack trace --- at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) at System.Threading.Tasks.Task.Wait() at Microsoft.Web.Hosting.SingleInstaller.WixBA.ProcessCommandLine() at Microsoft.Web.Hosting.SingleInstaller.WixBA.Run() ---> (Inner Exception #0) System.Exception: Deployment Failed. Refer deployment logs for more details at Microsoft.Web.Hosting.SingleInstaller.Logic.SilentDeployLogic.d42.MoveNext()<---

All the prerequisites of App service completed and the server images deployment also completed but only the App service is getting failed.

Now I am rerunning the script again. Meanwhile, can you check my error log and suggest any solution for this?

AppSvcLog0617-121926.txt

mattmcspirit commented 4 years ago

Hi Jon,

So this seems to be an App Service specific error, and could be transient. My script, when it runs, silently launches the AppService.exe, and passes in a load of arguments, along with a JSON file, and the AppService.exe then takes it from there. My script just waits until that's all finished. The log file you sent me is what the AppService.exe generates, and what it shows is this error:

{ "id": "/subscriptions//resourceGroups/appservice-infra/providers/Microsoft.Resources/deployments/AppService.DeployCloud/operations/507DA26463EFB0F6", "operationId": "507DA26463EFB0F6", "properties": { "provisioningOperation": "Create", "provisioningState": "Failed", "timestamp": "2020-06-17T16:12:16.7950323Z", "duration": "PT1H31M34.9427796S", "trackingId": "701c9aef-7af1-4f31-8acf-dd8a85fb0d4d", "serviceRequestId": "11766a53-52cd-48cd-a986-f64b128d0459", "statusCode": "Conflict", "statusMessage": { "status": "Failed", "error": { "code": "ResourceDeploymentFailure", "message": "The resource operation completed with terminal provisioning state 'Failed'." } }, "targetResource": { "id": "/subscriptions//resourceGroups/appservice-infra/providers/Microsoft.Compute/virtualMachines/CN0-VM", "resourceType": "Microsoft.Compute/virtualMachines", "resourceName": "CN0-VM" } } },

and also:

"statuses": [ { "code": "ProvisioningState/failed/VMExtensionProvisioningError/osProvisioningComplete", "level": "Error", "displayStatus": "OS provisioning complete", "message": "Failed to provision VM extensions for VM 'CN0-VM'", "time": "2020-06-17T16:12:02.32868+00:00" }, { "code": "PowerState/running", "level": "Info", "displayStatus": "VM running" } ]

The CN-0 VM is one that the AppService.exe creates and deploys, and is the controller VM for the App Service RP. The extension executes after VM deployment and runs scripts to finish setting up that VM itself. I've seen instances where this randomly doesn't complete, and sometimes it's down to I/O and performance on the hardware.

You say that your hardware meets the minimum recommendations - so you have 4 drives plus an OS drive, but are any of those SSDs? I've seen timeouts like this error due to low performance storage/hardware, and unfortunately, it's not something I can fix in my script.

Could you share more about your hardware?

JonParvez commented 4 years ago

Hi Matt, thanks again for your response.

We have no SSDs, they are SAS Hard drive.

Disk drives: 1 OS disk with 558GB for OS and other disks are also 558GB each. Dual-Socket: 16 Physical Cores (total) 128 GB RAM Hyper-V Enabled (with SLAT support) RAID HBA - Adapter is configured in "pass-through" mode. Disks are configured as Single-Disk, RAID-0.

I believe we have met up the requirements for ASDK installation. Please suggest something that will work for us to execute the script successfully.

Thanks for your patience.

mattmcspirit commented 4 years ago

Hi Jon,

From a CPU and Memory perspective, you're fine, however the disks are going to struggle without an SSD as cache.

When you finish deploying the ASDK, there's ~12 VMs running, and adding the SQL/MySQL bits, adds another 4 (RP VMs and DB Hosts), then there's an additional SQL and File Server VM to support the App Service deployment. So, you're at 18 VMs running across those 4 SAS HDDs.

When the App Service deploys, on the ASDK using my script, I try to be as efficient as possible, only deploying 1 VM instance for each of the required roles (controller, management, publisher, frontend and worker). So, that's another 5 VMs on top of the 18 you're already running, and i believe, that's where your system is falling down I'm afraid.

One thing you could try, is to edit the DeployAppService.ps1 script in the Scripts folder, where your AzSPoC.ps1 file is located, and edit Line 288: https://github.com/mattmcspirit/azurestack/blob/master/deployment/powershell/DeployAppService.ps1#L288

And change Standard_A2 to Standard_A3, which will give the Controller VM a bit more memory and CPU but even then, it won't buy you extra storage i/o.

If you want to eliminate my script, just manually deploy AppService.exe and follow the wizard, but i suspect you will end up with the same result.

Hope it works! Matt

JonParvez commented 4 years ago

Hi Matt, thanks for your suggestion.

I was thinking to delete the resources for MySQL and Ubuntu related items because we were just setting up all the stuff on your test environments.

Can you suggest how to delete the MySQL and Ubuntu resources that won't affect the other resources and how could I rerun the script by skipping the MySQL and Ubuntu resource installation?

I think after that rerunning the script would work.

Thanks for your help.

mattmcspirit commented 4 years ago

Hi,

So, you could clean up the MySQL and SQL Resource Providers. You don't need to remove the Ubuntu image, as this doesn't impact I/O and running workloads - it's just an image.

If it were me, I'd do a clean deployment, and use the -skipMySQL and --skipMSSQL to not install those bits, but if you want to clean up, you should first remove the SQL/MYSQL hosting servers. The steps here: https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-mysql-resource-provider-hosting-servers?view=azs-2002#connect-to-a-mysql-hosting-server walk you through adding one, so just do the reverse for both SQL and My SQL.

Once the hosting servers have been disconnected from the Resource Providers, you should then be able to delete the MySQL and SQL database host VMs. Note, these are in the tenant portal, so you'll need to go to portal.local.azurestack.external to find those. You could just delete the whole Resource group that contains the DB hosts, assuming you've disconnected the hosting servers from both the SQL and MySQL RP.

Once you've cleaned up the hosting servers, you can then remove the MySQL and SQL RPs themselves. The steps for SQL are here: https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-sql-resource-provider-remove?view=azs-2002 and MySQL: https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-mysql-resource-provider-remove?view=azs-2002

For reference, your Privileged Endpoint is "AzS-ERCS01".

Doing all of that will reduce your VM count by 4 and that's as minimal as you can get if you still want the App Service.

Good luck!

JonParvez commented 4 years ago

Hi Matt, We have tried to remove the MySQL RPs and executed your script but no luck. Then we tried to manually deploy the app service as all the app service prerequisites were met up with your scripts.

So we first clean up the DB and remove the app service resource group as per the answer of issue #72 and run the installer but no luck again. The installation got stopped in the Deploy App Service portion. So we clean up the resources again and tried installing again and this time this step got passed successfully but got stuck on the Register AppService Admin Resource Provider and then tried another clean up and installation and it stucked into the Deploy App Service again.

This is the whole scenario until now. Can you suggest anything? I will appreciate it. appservice_20200623_065940.log

Thanks in advance.

mattmcspirit commented 4 years ago

Hi Jon,

It's not clear from the log, what the error is, but i think it's likely the same one we saw before:

{
  "code": "ProvisioningState/failed/VMExtensionProvisioningError/osProvisioningComplete",
  "level": "Error",
  "displayStatus": "OS provisioning complete",
  "message": "Failed to provision VM extensions for VM 'CN0-VM'",
  "time": "2020-06-23T11:04:23.0568417+00:00"
},

If you look at the #72 you referenced, you'll see the thing that worked for them, was adding an SSD to the machine (https://github.com/mattmcspirit/azurestack/issues/72#issuecomment-454306293). If you want to try one more time, here's what I'd do:

  1. CTRL+C to stop any PowerShell running tasks
  2. Cancel/Close the App Service executable
  3. Stop all PowerShell jobs by running Get-Job | Stop-Job and then Get-Job | Remove-Job
  4. Close remaining PowerShell Windows/ISEs etc.
  5. Clean up using below script (updated from #72)
# Login to Azure Stack
$ArmEndpoint = "https://adminmanagement.local.azurestack.external"
Add-AzureRMEnvironment -Name "AzureStackAdmin" -ArmEndpoint "$ArmEndpoint" -ErrorAction Stop
Add-AzureRmAccount -EnvironmentName "AzureStackAdmin" -ErrorAction Stop
$azsLocation = (Get-AzureRmLocation).DisplayName

# Clean Database
$VMpwd = Read-Host "Insert your password you used for -VMpwd when you originally ran the script"
$secureVMpwd = ConvertTo-SecureString -AsPlainText $VMpwd -Force
$SQLServerUser = "sa"
$sqlAppServerFqdn = "sqlapp.local.cloudapp.azurestack.external"
$dbCreds = New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $SQLServerUser, $secureVMpwd -ErrorAction Stop
$appServiceDBCheck = Get-SqlInstance -ServerInstance $sqlAppServerFqdn -Credential $dbCreds | Get-SqlDatabase | Where-Object {$_.Name -like "*appservice*"}
foreach ($appServiceDB in $appServiceDBCheck) {
    Write-Host "$($appServiceDB.Name) database found. Cleaning up to ensure a successful rerun of the AppService deployment"
    $cleanupQuery = "ALTER DATABASE $($appServiceDB.Name) SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE $($appServiceDB.Name)"
    Invoke-Sqlcmd -Server $sqlAppServerFqdn -Credential $dbCreds -Query "$cleanupQuery" -Verbose 
}

$appServiceLoginCheck = Get-SqlLogin -ServerInstance $sqlAppServerFqdn -Credential $dbCreds -Verbose: $false | Where-Object { ($_.Name -like "*appservice*") -or ($_.Name -like "*WebWorker*") }
foreach ($appServiceLogin in $appServiceLoginCheck) {
    Write-Host "$($appServiceLogin.Name) login found. Cleaning up"
    Remove-SqlLogin -ServerInstance $sqlAppServerFqdn -Credential $dbCreds -LoginName $appServiceLogin.Name -Force -Verbose
}

# Clean resource group
$appServiceFailCheck = (Get-AzureRmResourceGroupDeployment -ResourceGroupName "appservice-infra" -Name "AppService.DeployCloud" -ErrorAction SilentlyContinue)
if ($appServiceFailCheck.ProvisioningState -eq 'Failed') {
    Write-Output "There is evidence of a previously failed App Service deployment in the App Service Resource Group. Starting cleanup..."
    Get-AzureRmResourceGroup -Name "appservice-infra" -Location $azsLocation -ErrorAction SilentlyContinue | Remove-AzureRmResourceGroup -Force -ErrorAction SilentlyContinue -Verbose
}

Write-Host " Log back into the portal and check the appservice-infra resource group has gone"

I would make sure you manually log into the portal and delete the App Service RGs that my script/you created manually, as my script above is only looking for the RG name "appservice-infra", which may not be the same as what you used when deploying with the AppService.exe manually.

Once that's completed, try manually one last time, using the wizard. Note, deploying the App Service can take ~90 minutes even when you have SSDs and HDDs in the box, so having no SSDs at all, don't be surprised if the process takes longer. It may look like it's stuck, and if you open the log file, it doesn't seem to be changing from:

[8438:80B8][2020-06-23T12:04:58]i000: [Websites]: Deployment status: Running. Last updated: 2020-06-23 12:04:58. Total elapsed time: 00:00:45

But that's ok, just wait until the process either succeeds, or fails. The failure will most likely come due to a timeout I'm afraid.

At that point, I'm afraid there's not much I can suggest apart from having at least 1 or 2 SSDs in the box, even low-end SATA SSDs would be enough.

Thanks, Matt

JonParvez commented 4 years ago

Hi Matt,

Thanks for your service. We have deployed app service successfully.

The solution was that there is an extension installed for CN0-VM; we just uninstalled and run the app service deployment then it completed the deployment and it also re-installed that extension.

You can close this issue for now. By the way, do you know how to connect the web apps deployed on the app service to the internet? The apps/API, we deployed on the App Service can be browsed from the Host server but not from anywhere on the internet.

Thanks again.

mattmcspirit commented 4 years ago

You're welcome! Best of luck with your ASDK!