hashicorp / packer

Packer is a tool for creating identical machine images for multiple platforms from a single source configuration.
http://www.packer.io
Other
15.04k stars 3.32k forks source link

[Azure] WinRM timeout with Windows 2016-Datacenter Marketplace Image #8658

Closed Dilergore closed 3 years ago

Dilergore commented 4 years ago

Please refer the end of this thread to see other users complaining that this is not working. https://github.com/MicrosoftDocs/azure-docs/issues/31188

Issue:

Started: December, 2019. Packer cannot connect with WinRM to machines provisioned from Windows 2016 (2016-Datacenter) Marketplace image in Azure.

Further details:

WinRM timeout increase is not working. It seems the last image working is version: "14393.3326.1911120150" (Released 12th of Nov). It stopped working with "14393.3384.1912042333" (Released 10th of Dec).

This issue is only impacting 2016-Datacenter. 2019 is working properly.

To get image Details for a Region:

az vm image list --location northeurope --offer WindowsServer --publisher MicrosoftWindowsServer --sku 2016-Datacenter --all

URL to the Last Working Image:

https://support.microsoft.com/en-us/help/4525236/windows-10-update-kb4525236

URL to the Image where something went wrong:

https://support.microsoft.com/en-us/help/4530689/windows-10-update-kb4530689

Notes:

This is currently applying to North EU. I had no time to investigate in other regions but I believe the same images getting distributed to every region.

I am opening a Microsoft case and planning to update the thread with the progress.

alemuro commented 4 years ago

Hello everyone,

Looking to the logs, I see that Packer tries one time to access to the WinRM service. Is this expected? AFAIK, we should see more lines like [INFO] Attempting WinRM connection... unless the connection succeeds, right?

Regards,

2020/02/13 17:19:31 packer-builder-azure-arm plugin: Waiting for WinRM, up to timeout: 30m0s
2020/02/13 17:19:31 ui: ==> azure-arm: Waiting for WinRM to become available...
2020/02/13 17:19:31 packer-builder-azure-arm plugin: [INFO] Attempting WinRM connection...
2020/02/13 17:19:31 packer-builder-azure-arm plugin: [DEBUG] connecting to remote shell using WinRM
2020/02/13 17:19:44 packer-builder-azure-arm plugin: Checking that WinRM is connected with: 'powershell.exe -EncodedCommand [...]'
2020/02/13 17:19:44 packer-builder-azure-arm plugin: [INFO] starting remote command: powershell.exe -EncodedCommand [...]
2020/02/13 17:49:31 ui error: ==> azure-arm: Timeout waiting for WinRM.
2020/02/13 17:49:31 packer-builder-azure-arm plugin: Communication connection err: context canceled
2020/02/13 17:49:31 packer-builder-azure-arm plugin: WinRM wait canceled, exiting loop
2020/02/13 17:49:31 ui: ==> azure-arm: 
danielsollondon commented 4 years ago

Hi All, I was going to update you, but @Dilergore beat me to it :-) If there are any changes with ETA, we will let you know.

BruceShipman commented 4 years ago

@danielsollondon - So this is an issue with Server 2016 images only? I get this issue with Server 2012 R2 and Server 2019, as well (particularly late in the business day,) but Server 2016 is the usual problem child. Although I did have two Server 2019 builds fail early this morning with WinRM timeout errors...

adamrushuk commented 4 years ago

I've also seen the same issues with Server 2012 R2. @Dilergore will the fix also be in the latest February Windows Server 2012 image, once available from the Marketplace?

shurick81 commented 4 years ago

New images:

2019-Datacenter - 17763.1039.2002091844
2016-Datacenter - 14393.3504.2002070914
2012-R2-Datacenter - 9600.19629.2002070917
shurick81 commented 4 years ago

Same Timeout waiting for WinRM. with this configuration:

"os_type": "Windows",
"image_publisher": "MicrosoftWindowsServer",
"image_offer": "WindowsServer",
"image_sku": "2019-Datacenter",
"image_version": "17763.1039.2002091844",

"communicator": "winrm",
"winrm_use_ssl": "true",
"winrm_insecure": "true",
"winrm_timeout": "30m",
"winrm_username": "packer",

"vm_size": "Standard_DS2_v2",
"managed_image_storage_account_type": "Premium_LRS",
BruceShipman commented 4 years ago

Building 2012 R2, 2016, and 2019 images in West US 2, using Standard_D2s_v3 and latest marketplace image available.

Morning and early afternoon builds were all fine: did 5 sets of them. The last two sets have had hung waiting for WinRM issues - on all three targets. Sigh.

ghost commented 4 years ago

2019 still timing out for me. Anyone else?

BruceShipman commented 4 years ago

2012R2, 2016, 2019 builds are fine overnight and morning, but they all get WinRM timeouts starting around 13:30 (UTC-0800). Timeouts continue until sometime in the late evening. If I use the run command tool in the portal to use PowerShell to create a new self-signed cert & WinRM listener the process completes successfully.

ghost commented 4 years ago

Yep same, none are working.

Region US East. Sizes: all WinRM timeout amount: alot.....

The MSFT guy closed their issue when the image was updated. Obv, wasn't the fix.

I wish not so great things on the people that decided we would use packer... AWS, GCP work without a issue, I'm talking end to end including pipelines for all windows versions in a matter of hours. Azure on the other hand.... it is throwing a temper tantrum. Week 4... and this isn't even my main issue. The main issue i have is powershell and Puppet "completing" before they were actually complete.

Funny that AWS and GCP have better and ALOT faster windows builds than Microsoft. Not sure how that happened.

danielsollondon commented 4 years ago

Hi, sorry for the delay in getting back to you, I have spoken to the Windows PG team, it looks like there could another issue. @BruceShipman and @ffalorjr - I see you are using 2012R2, 2016, 2019 in WestUS2, and EastUS, what version of Packer are you using? Once I have this, I will try to repro.

danielsollondon commented 4 years ago

Sorry, I forgot to address this questions, @BruceShipman - the fix in the Window 2016 was an issue, that resulted in WinRM timeouts, but the cause of this issue does not exist in 2019, this is why we think there could be another issue here.

ghost commented 4 years ago

@danielsollondon this is my packer version Packer version: 1.5.2 Thanks!

danielsollondon commented 4 years ago

thanks @ffalorjr - I will try to repro this, if I cannot, I will come back to you.

BruceShipman commented 4 years ago

@danielsollondon - I'm currently using Packer 1.5.1. If you think it might help I can update that to 1.5.4

danielsollondon commented 4 years ago

@BruceShipman I have not reviewed the release, so I don't know if it will help. If I can repro the issue, I will try this, I will come back within 24hrs.

urbanchaosmnky commented 4 years ago

FYI: Having the same issue in CanadaCentral

Settings: Packer Version v1.5.4

winrm timeout is set to 15m

"image_publisher": "MicrosoftWindowsServer",
"image_offer": "WindowsServer",
"image_sku": "2019-Datacenter",
"location": "canadacentral",
"vm_size": "Standard_DS2_v2",
"base_image_version": "17763.973.2001110547",

This has been happening to me for the past week or so.

nickmhankins commented 4 years ago

Same issue in East US 2 on all Server 2019 based images. Server 2016 images are working fine for me.

Packer version 1.3.2, 1.5.1, and 1.5.4 tested

winrm timeout set to 60 minutes

      "os_type": "Windows",
      "image_publisher": "MicrosoftSQLServer",
      "image_offer": "sql2019-ws2019",
      "image_sku": "Enterprise",
      "image_version": "latest",
      "os_type": "Windows",
      "image_publisher": "MicrosoftWindowsServer",
      "image_offer": "WindowsServer",
      "image_sku": "2019-Datacenter",
      "image_version": "latest",
danielsollondon commented 4 years ago

@BruceShipman, @ffalorjr, @urbanchaosmnky, @nickmhankins - I have reproduced the timeout issue using the source below and locating the build VM in West US2, with Packer 1.5.4.

One question, where are you running packer, is that on a VM in the same data center, different DC, running from on premise, or in a build pipeline..?

"os_type": "Windows", "image_publisher": "MicrosoftWindowsServer", "image_offer": "WindowsServer", "image_sku": "2019-Datacenter", "image_version": "latest",

urbanchaosmnky commented 4 years ago

@danielsollondon

I'm running packer off my laptop, both on our company network and from my home internet connection. I don't know if that helps you. I haven't tried to run packer on a VM in the cloud.

ghost commented 4 years ago

@danielsollondon

I've run from my labtop, I've ran it in a pipeline with build servers on-prem, and I've ran it from a pipeline with build servers in azure (in the same sub and network).

Thanks

BruceShipman commented 4 years ago

@danielsollondon

I'm running these on cloud hosted AzDO pipelines.

ghost commented 4 years ago

@danielsollondon while I have you, have you came across the powershell or puppet provisoners exiting early and moving onto the next provisioner as if the first one completed fully without issue even though it did not complete all the way?

This is only happening to my azure builds. GCP and AWS use the exact same code, and they don't have that issue.

danielsollondon commented 4 years ago

Hi @ffalorjr - sorry I have not see that issue the powerShell or puppet provisioners, is it easily reproducable?

All - The Windows team have a repro of this and are investigating, we will update you on Monday.

ghost commented 4 years ago

Hi @ffalorjr - sorry I have not see that issue the powerShell or puppet provisioners, is it easily reproducable?

All - The Windows team have a repro of this and are investigating, we will update you on Monday.

@danielsollondon Thanks for the feedback just thought i'd ask. I don't want to derail this thread since the issue is not related to winrm.

I found out some in guest policies are begin pushed at the SUB level so getting those disabled to see if it fixes my issue. The repro steps for me has been simply add any provisoner to the build, run it a few times and atleast one time it will exit middle of the run as "complete" when it is not.

adamrushuk commented 4 years ago

Thanks for the latest Marketplace images guys. Since using the latest, the build success rate is >90% for both Win2012 and Win2016 :)

danielsollondon commented 4 years ago

Hi All, the Windows Team have provided this feedback, can you test with these build properties for Windows below, for the vm_size, please use an alternative to DSv2 vm_size, such as 'StandardD2' size.

"communicator": "winrm", "winrm_username": "packer", "winrm_insecure": true, "winrm_use_ssl": true, "vm_size": "Standard_D2_v2"

Can you let us know if this improves build success rates? Thanks.

urbanchaosmnky commented 4 years ago

@danielsollondon, So far I haven't had any success with a Standard_D2_V2 with Windows 2019 image in CanadaCentral.

danielsollondon commented 4 years ago

Thanks @urbanchaosmnky.

@BruceShipman , @ffalorjr, @nickmhankins - can you let us know if you are still broken with my previous post config? thanks,

BruceShipman commented 4 years ago

@danielsollondon , Overnight builds all failed due to Terraform azurerm provider automatically updating to v2.0 (annoyed, that'll teach me to pin my version,) which cause the upstrea pipeline that managed teh gallery and image definitions to fail. I'm testing the fix to the upstream pipeline, and will shortly start testing runs with Standard_D2s_v3 changed to Standard_D2_v2. It may take a while, as runs are mostly fine until about 2PM or so. (The rest of the config you had was the same as I already had.)

danielsollondon commented 4 years ago

Thanks @BruceShipman for letting me know.

urbanchaosmnky commented 4 years ago

@danielsollondon, I'm been able to build with packer and a win2019 image this morning so far no issues.

danielsollondon commented 4 years ago

@urbanchaosmnky - thanks for letting me know, are you still deploying into CanadaCentral? Please keep me informed here if you see any further failures, sorry I didn't investigate your issue further yesterday, we were doing more testing, and I was waiting for additional feedback from the other folks here to see if they are hitting further failures.

@BruceShipman , @ffalorjr, @nickmhankins - can you let me know if you are still seeing failures? thanks!

ghost commented 4 years ago

@danielsollondon thanks for the help. I've been doing tests so far I have not seen a winrm error after changing the size. Before changing the size the success rate was still greatly increased compared to previous days of running.

adamrushuk commented 4 years ago

I've been using Standard_DS2_v2 since the latest marketplace images, and have only had one build failure in the past week. A massive improvement! 👍

urbanchaosmnky commented 4 years ago

@danielsollondon Yes, I've still deploying in CanadaCentral I've deployed 10 times now with no issues. Thanks again.

BruceShipman commented 4 years ago

@danielsollondon - I had the 3 Windows pipelines in a loop yesterday until about 7 PM without a single failure, and the overnight build was fine, as well. So while I'd call this a work-around instead of a fix, it definitely allows us to build our images without babysitting the automation. YAY!

AdamOrpen commented 4 years ago

My builds are finally working using Standard_D2_v2. Thanks for the efforts to resolve this, even though it seems to be a temporary solution. When will all VM sizes be available?

sanchetanparmar commented 4 years ago

Thanks @danielsollondon finally build working using Standard_D2_v2 but still getting randomly timeout.

justinhauer commented 4 years ago

@danielsollondon I'm deploying in us central and w2k16 datacenter builds are timing out on me, I'm using a large size, Standard_D16s_v3", and I've set my timeout to 10 minutes. This was working a week ago which is really discouraging. :(

adamrushuk commented 4 years ago

I've been using Standard_D2_v2 and haven't had a single timeout. I am getting WinRM issues when starting DSC, but I'm sure that's a different issue.

EricLocsin commented 4 years ago

This was working fine over the weekend. Unfortunately I haven't been able to get past the timeout issue all day today (3/9/2020). I was trying to use Standard_D2_v2 as well as more powerful sources. Increasing the timeout to 30 minutes didn't help either.

dave-5 commented 4 years ago

I too am seeing repeated winrm timeouts when creating a Datacenter 2019 image:

"winrm_insecure": true, "winrm_use_ssl": true, "winrm_timeout": "30m", "vm_size": "Standard_D3_v2", "os_type": "Windows", "image_offer": "MicrosoftWindowsServer", "image_publisher": "WindowsServer", "image_sku": "2019-Datacenter" "image_version": "latest"

Creating in North Central US region from a self-hosted DevOps Agent also running in North Central US.

rpayne-rms commented 4 years ago

We too have experienced the WinRM timeout issue over several months. Per the above recommendation, we changed "vm_size": "Standard_DS3_v2" -->> "vm_size": "Standard_D2_v2"

On Packer v1.5.4, we've consecutively ran 3 successful builds of 3(1-2012, 2-2016s) VMs in US East. Fingers crossed for this workaround working tomorrow!

EricLocsin commented 4 years ago

I had more success when I reverted back to using v1.5.1. As soon as I go back to 1.5.3 or 1.5.4, it starts to time out again. Unfortunately "elevated_password" is broken in 1.5.1 and I need that to work.

michaelmowry commented 4 years ago

Running packer 1.5.4 to provision an Azure image. Details:

    "os_type": "Windows",
    "image_publisher": "MicrosoftWindowsServer",
    "image_offer": "WindowsServer",
    "image_sku": "2019-Datacenter-smalldisk-g2",
    "location": "East US",
    "vm_size": "Standard_B2ms"

When I run Test-WSMan -ComputerName 40.76.44.11 -usessl to check the WinRM connection i get this error. I have also tried with the machine DNS name "pkrvm7ekd2nbewu.eastus.cloudapp.azure.com" and have tried with different images and different sizes, same issue. This was working last Monday and has been giving issues since then.

Error: The server certificate on the destination computer (40.76.44.11:5986) has the following errors:
The SSL certificate is signed by an unknown certificate authority.
The SSL certificate contains a common name (CN) that does not match the hostname.

Also I have tried the commands suggested to reset the SSL certificate and I still get the error after they succeed. I ran them using the Azure Run Command on the packer VM I am creating.

$Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "$env:COMPUTERNAME" Remove-Item -Path WSMan:\Localhost\listener\listener -Recurse New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address -CertificateThumbPrint $Cert.Thumbprint -Force Stop-Service winrm Start-Service winrm

This is pretty frustrating, really like Packer but it doesn't seem to like Azure right now.

danielsollondon commented 4 years ago

Hi All - Sorry for the delay, one of the causes of the timeout issue is exhibited when premium VM sizes are used, although it is not directly to do with the premium offerings, these are VM sizes which contain 'S' in the size, and is being resolved, I will come back when this has completed. If you are still seeing issues with WinRM timeout, try using a VM Size that does not contain an 'S', such as changing from Standard_DS2_v2 to Standard_D2_v2 etc. These were the setting I used during testing:

"communicator": "winrm", "winrm_username": "packer", "winrm_insecure": true, "winrm_use_ssl": true, "vm_size": "Standard_D2_v2"

For those of you who still have issues with that config (or vm_size with a no 'S'), please let me know.

shurick81 commented 4 years ago

I still have issues with building Windows Server 2019 (not 2016).

"os_type": "Windows",
"image_publisher": "MicrosoftWindowsServer",
"image_offer": "WindowsServer",
"image_sku": "2019-Datacenter",
"image_version": "latest",

"communicator": "winrm",
"winrm_use_ssl": "true",
"winrm_insecure": "true",
"winrm_timeout": "30m",
"winrm_username": "packer",

"vm_size": "Standard_DS2_v2",
"managed_image_storage_account_type": "Premium_LRS",

region: west europe

shurick81 commented 4 years ago

with the size you recommend packer has no issues with creating VMs:

"os_type": "Windows",
"image_publisher": "MicrosoftWindowsServer",
"image_offer": "WindowsServer",
"image_sku": "2019-Datacenter-smalldisk",
"image_version": "latest",

"communicator": "winrm",
"winrm_use_ssl": "true",
"winrm_insecure": "true",
"winrm_timeout": "30m",
"winrm_username": "packer",

"vm_size": "Standard_D2_v2",
"managed_image_storage_account_type": "Standard_LRS",

However, builds became much slower with standard disks...

faizan002 commented 4 years ago

Hi, I have this timeout issue which is quite sporadic but sometimes I cant create an image for a whole day and it works the next day. Lately I started to see the issue with "ssh timeout" for a ubuntu Image creation as well :( This is so frustrating.