jetbrains-infra / packer-builder-vsphere

Packer plugin for remote builds on VMware vSphere
Mozilla Public License 2.0
543 stars 175 forks source link

v2.0 - installations hang during "Setup" #119

Closed chris-david-taylor closed 6 years ago

chris-david-taylor commented 6 years ago

The v2.0 plugin seems to have a bug, regarding installing Windows. (I haven't tried others yet.). My present lab runs on vSphere 6.5.

Steps to reproduce;

  1. Take a working configuration, with the 2.0-beta4 plugins installed.
  2. Update the plugin from v2-beta4 to v2.0
  3. Run Packer - Installation hangs at "Setup"

Confirmed on Windows 2012_r2, and Windows 7.

I'll try and get some logging out of our environment tomorrow, my permissions are too locked down for me to look. Part of me thinks this may be related to #112 I have also tried updating Packer to 1.2.3.

chris-david-taylor commented 6 years ago

Further debug; I just spotted this morning for Windows 7; "Windows cannot apply the DiskConfiguration in Autounattend.xml".

embusalacchi commented 6 years ago

Well I spent the entire day on this.. and I probably should have come here first.

I am seeing the same situation with vSphere 6.5 w/DRS and the release version of the packer-builder-vsphere-iso.exe. At first I thought it was because in the release version the disk size is in MB and not GB so I went from an 80GB partition to an 80MB partition. But after I realized that wasn't what was going on I spent pretty much all day today trying to figure out what I had done wrong. The part that through me off is that the vSphere GUI when working with the packer VM that it has created is almost totally unresponsive. Shutting down the VM usually fails or errors out a few times. Getting to the console will hang or not connect at all. The RAM usage generally tries to consume all of the available RAM for the VM as well. CPU doesn't spike. I/O doesn't spikes. Nothing. What I found interesting though is if I left the process running and did a "reset" on the vm through the VMRC the vm booted normally and sped through the setup without and completed quickly. So, there's some interaction with the release version of the plug-in during vm creation and vSphere that wasn't there in prior versions. Initially I thought it was because I was running packer from a different server than I was before. And then I realized that on the new server (running Jenkins) I had downloaded a newer version of the plug-in. As soon as I changed the plug-in for the pre-release version and fixed the disk size (as it was now trying to create in 80000GB drive) it worked as expected. The version of the plugin that works for me -I don't know the version number - but says it from 4/12/18. If there is additional logging or anything else I can do to help you troubleshoot this please let me know. It is very easy to reproduce.

embusalacchi commented 6 years ago

Looks like https://github.com/jetbrains-infra/packer-builder-vsphere/issues/104 and https://github.com/jetbrains-infra/packer-builder-vsphere/issues/112 and https://github.com/jetbrains-infra/packer-builder-vsphere/issues/119 might be all the same issue.

sudomateo commented 6 years ago

@embusalacchi looking like it. We'll need some insight on what might have changed between the 2.0beta4 release and the 2.0 release. I've been trying to go build by build from the public teamcity server located here but I don't really have the time to do so and keep these hung VMs in my inventory. I also don't know which build corresponds to the 2.0beta4 release so I can work up from there.

chris-david-taylor commented 6 years ago

Last commit for 2.0-beta4 was 15th of March to add Cluster Support.

chris-david-taylor commented 6 years ago

I did a build from 25th April after commit #82 and the issue isn't present then. Hope that is of some help @sudomateo ?

sudomateo commented 6 years ago

@chris-david-taylor thank you sir. I'll check that build out.

kempy007 commented 6 years ago

are you using the boot_cmd in packer?? my VMs lock up after this??

chris-david-taylor commented 6 years ago

I'm not @kempy007. It's not the boot command that is the issue. @embusalacchi suggested these might be all related, possibly to floppy_media. First of all, does 2.0beta4 work for you, and what OS are you templating?

I'm not an expert in Golang and currently working 12 hour days, otherwise I'd learn a bit and investigate, but as long as you aren't desperate for winrm, then building from #82 should be OK. Are you comfortable doing that?

embusalacchi commented 6 years ago

I don't mind trying other builds but I don't have the means to build them on my own. Is there a link somewhere? I don't mind trying them as I have time to nail down when it went bad.

kempy007 commented 6 years ago

RHEL6, packer is now 1.2.4. Issue only occurs with boot_cmd and after it is invoked the VM never uses more than 30mhz thus appears hung. restart and shutdown from vmwrc may fail. ctrl + alt + del inside vm does reboot.

I think it maybe related which is why I wanted to know if you have boot_cmd in your packer file.

embusalacchi commented 6 years ago

I am not using the boot_cmd in my packer file.

sudomateo commented 6 years ago

@embusalacchi Builds can be found here: https://teamcity.jetbrains.com/viewType.html?buildTypeId=PackerVSphere_Build&branch_PackerVSphere=%3Cdefault%3E&tab=buildTypeStatusDiv

Just log in as guest and download the build you want.

Also I am using boot_command in my packer file.

chris-david-taylor commented 6 years ago

I'm not using the boot_cmd parameter @kempy007

Thanks @sudomateo :)

schmandforke commented 6 years ago

found a line in esxi logs:

2018-06-12T09:57:39.190Z host001.local Hostd: warning hostd[F3C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplate009.vmx opID=32cd7e53-01-1b-5afc user=vpxuser:foobar] CannotRetrieveCorefiles: VM is in an invalid state
2018-06-12T09:57:39.225Z host001.local Hostd: warning hostd[F3C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplate009.vmx opID=32cd7e53-01-1b-5afc user=vpxuser:foobar] File - failed to get objectId, '/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplatte009.vmx': One of the parameters supplied is invalid.

seems to be invalid parameter in the vmx file:

sched.mem.pin = "TRUE"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "4096"
sched.mem.minSize = "4096"
sched.mem.shares = "normal"

=> https://kb.vmware.com/s/article/2085907

maybe this is the issue, if i remove the lines above, the vm is not hanging :)

embusalacchi commented 6 years ago

Here's the generated .vmx from the not-working version of the plugin (vsphere-iso):

.encoding = "UTF-8" config.version = "8" virtualHW.version = "11" nvram = "windows2016-full-L1.ptest1.nvram" pciBridge0.present = "TRUE" svga.present = "TRUE" pciBridge4.present = "TRUE" pciBridge4.virtualDev = "pcieRootPort" pciBridge4.functions = "8" pciBridge5.present = "TRUE" pciBridge5.virtualDev = "pcieRootPort" pciBridge5.functions = "8" pciBridge6.present = "TRUE" pciBridge6.virtualDev = "pcieRootPort" pciBridge6.functions = "8" pciBridge7.present = "TRUE" pciBridge7.virtualDev = "pcieRootPort" pciBridge7.functions = "8" vmci0.present = "TRUE" hpet0.present = "TRUE" numvcpus = "2" memSize = "16384" sched.cpu.units = "mhz" scsi0.virtualDev = "pvscsi" scsi0.present = "TRUE" scsi0:0.deviceType = "scsi-hardDisk" scsi0:0.fileName = "windows2016-full-L1.ptest1.vmdk" scsi0:0.present = "TRUE" ethernet0.virtualDev = "vmxnet3" ethernet0.dvs.switchId = "46 65 0b 50 2c 39 e7 3b-ef da 69 ea 94 a4 ec e0" ethernet0.dvs.portId = "168" ethernet0.dvs.portgroupId = "dvportgroup-36" ethernet0.dvs.connectionId = "1675707751" ethernet0.addressType = "vpx" ethernet0.generatedAddress = "00:50:56:8b:4c:07" ethernet0.uptCompatibility = "TRUE" ethernet0.present = "TRUE" displayName = "windows2016-full-L1.ptest1" guestOS = "windows9srv-64" uuid.bios = "42 0b 30 ac 5e 15 7d d5-61 ba be b0 2d a3 d4 51" vc.uuid = "50 0b b5 4e a6 6e d7 77-ab 89 74 59 d3 f2 b6 61" sata0.present = "TRUE" sata0:0.deviceType = "cdrom-image" sata0:0.fileName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/ISOs/SW_DVD9_Win_Svr_STD_Core_and_DataCtr_Core_2016_64Bit_English_-2_MLF_X21-22843.ISO" sata0:0.present = "TRUE" sata0:1.deviceType = "cdrom-image" sata0:1.fileName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/ISOs/VMware-tools-windows-10.0.9-3917699.iso" sata0:1.present = "TRUE" floppy0.fileType = "file" floppy0.fileName = "packer-tmp-created-floppy.flp" bios.hddOrder = "scsi0:0" bios.bootOrder = "hdd,cdrom,cdrom" sched.cpu.min = "0" sched.cpu.shares = "normal" sched.mem.min = "0" sched.mem.minSize = "0" sched.mem.shares = "normal" floppy0.clientDevice = "FALSE" virtualHW.productCompatibility = "hosted" sched.swap.derivedName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/windows2016-full-L1.ptest1/windows2016-full-L1.ptest1-49feaf28.vswp" uuid.location = "56 4d 28 31 3d 59 be 45-05 42 a4 45 e6 44 d4 bb" replay.supported = "FALSE" replay.filename = "" migrate.hostlog = "./windows2016-full-L1.ptest1-49feaf28.hlog" scsi0:0.redo = "" pciBridge0.pciSlotNumber = "17" pciBridge4.pciSlotNumber = "21" pciBridge5.pciSlotNumber = "22" pciBridge6.pciSlotNumber = "23" pciBridge7.pciSlotNumber = "24" scsi0.pciSlotNumber = "160" ethernet0.pciSlotNumber = "192" vmci0.pciSlotNumber = "32" sata0.pciSlotNumber = "33" scsi0.sasWWID = "50 05 05 6c 5e 15 7d d0" vmci0.id = "765711441" vm.genid = "7927475490584792074" vm.genidX = "3729250489358533764" monitor.phys_bits_used = "42" vmotion.checkpointFBSize = "4194304" vmotion.checkpointSVGAPrimarySize = "4194304" cleanShutdown = "FALSE" softPowerOff = "FALSE"

embusalacchi commented 6 years ago

prd2ts01_-_2560x1440

embusalacchi commented 6 years ago

vmware.log

embusalacchi commented 6 years ago

What's strange is if you do a RESET on the VM it will boot perfectly and go through the install. If you compare the .vmx when "broken" and the .vmx after the reset it is identical. So, it's not entirely clear where the issue is unless it boots with a bad value but vSphere writes out a good value that it uses when it boots after the reset?

chris-david-taylor commented 6 years ago

Hi @schmandforke, Can you possibly try and get the vmx file generated by 2.0beta4 and then do a “diff” and post it here please? I think there could be a workaround; if we know what the specific invalid parameters are, we might be able to set them in the vmx config of the packerfile.

embusalacchi commented 6 years ago

@chris-david-taylor here you go - this is from beta4 - windows2016-fbeta4.vmx.txt

embusalacchi commented 6 years ago

@chris-david-taylor from v2 - windows2016-full-L1.v2.vmx.txt

embusalacchi commented 6 years ago

@chris-david-taylor I was looking through the go code (I don't really know go) to see if I could figure out what's being set wrong. If you look at my previous comments the VM works after a reset (and not changing anything else). The .vmx from the slow version and the fast version after the reset seems identical so it's almost like it starts up with a bad param but vSphere fixes it? Not sure - I might just be missing something.

chris-david-taylor commented 6 years ago

Hi @embusalacchi, I've looked at those logs and the difference is that the plugin seems to now set them where as before it didn't.
I'd say we need to add something like this to our packerfiles, but we'll have to experiment to find what the correct values should be. Maybe you can grab those from the console in vSphere? I'm not back in work until tomorrow to test though:

"vmx_data": { "sched.mem.pin": "TRUE", "sched.cpu.min": "0", "sched.cpu.shares": "normal", "sched.mem.min": "4096", "sched.mem.minSize": "4096", "sched.mem.shares": "normal", "sched.cpu.units": "mhz" }

xenithorb commented 6 years ago

Yeah I think you're onto something here. When you look at the settings in vCenter, "CPU Limit" is set to "0MHz" instead of what it would normally be which is "Unlimited"

xenithorb commented 6 years ago

Following that suspicion, I think I now have a viable workaround:

        "CPU_limit": -1,

In your .json seems to do the trick.

schmandforke commented 6 years ago

confirmed that

 "CPU_limit": -1

worked for me !

chris-david-taylor commented 6 years ago

Yay! Let’s leave this open as it will hopefully help with debugging.

dominikmueller commented 6 years ago

I've also had the problem that my ubuntu-1604 installation got stuck at the setup. Can confirm, that it works with CPU_limit set to -1 👍

xenithorb commented 6 years ago

I really, really don't know very much about Go, but I did some digging and I think the bug might be on Line 40400 (not a typo) of this file: https://raw.githubusercontent.com/vmware/govmomi/master/vim25/types/types.go

40395 type ResourceAllocationInfo struct {
40396     DynamicData
40397
40398     Reservation           *int64      `xml:"reservation"`
40399     ExpandableReservation *bool       `xml:"expandableReservation"`
40400     Limit                 *int64      `xml:"limit"`
40401     Shares                *SharesInfo `xml:"shares,omitempty"`
40402     OverheadLimit         *int64      `xml:"overheadLimit"`
40403 }

I get the feeling that it should be "xml:"limit,omitempty", potentially meaning that this could be an upstream bug.

I tried searching through this project to determine how defaults were set, but there don't seem to be any so I'm not sure how the project maintainer would want to work around this bug by setting one for CPULimit @mkuzmin We could sure use your help here! Thanks :)

embusalacchi commented 6 years ago

I don't know enough about go either - but I believe some values are being pulled in from the vmware Go API project.

sudomateo commented 6 years ago

Can confirm "CPU_limit": -1, works for me too. Running ESXi 6.5 using the latest 2.0 plugin release. I'm building CentOS 7 machines.

bijujo commented 6 years ago

"CPU_limit": -1 worked for me too in ESXi 6.5. Thanks.

kempy007 commented 6 years ago

"CPU_limit": -1, worked in problem environment for me too. Seems to be in older version than 6.5 update2 of esxi image. Confirmed issue is present in version 'ESXi 6.5 U1 VMSA-2018-0004.3*'

Can someone update readme.md to add the above workaround as strongly recommended to avoid this issue?

xenithorb commented 6 years ago

Can someone update readme.md to add the above workaround as strongly recommended to avoid this issue?

I'd argue instead that this needs to be fixed so that "Unlimited" is the default ...... or perhaps there's a way to consume an object from the API that actually reveals the cluster defaults?

I'd really hope that this doesn't just become some obligatory settings.

sudomateo commented 6 years ago

@xenithorb I agree. This needs to be addressed in the code, whether upstream or in this plugin.

chris-david-taylor commented 6 years ago

@sudomateo - I’ll file a bug upstream with VMware at some point today. :)

calebherbison commented 6 years ago

Works for me. ESXi 6.0, 2.0 vsphere-iso plugin, CentOs 7 Minimal

jcoconnor commented 6 years ago

FWIW I'm seeing behaviour like this regularly - especially on Win-10 machines when I apply the cumulative updates. Resetting through VMRC helps but machine goes to hang again. Researching it with out infrastructure folks to see if there is any issues with our vCenter.

jcoconnor commented 6 years ago

Confirming that adding "CPU_limit": -1 improves things a lot. Also set

"svga.vramSize"    : "134217728",
"svga.autodetect"  : "FALSE",
"svga.maxWidth"    : "1680",
"svga.maxHeight"   : "1050"
sparky005 commented 6 years ago

The CPU_limit fix worked for me as well. Can this at least be set as the default? It will probably save lots of people a lot of time.

chris-david-taylor commented 6 years ago

The fix belongs in VMware’s upstream libraries. I’ve submitted a bug which I should check up on, as I’m starting to write my own code that depends on the upstream.

sparky005 commented 6 years ago

Got it, thanks @chris-david-taylor. Is there a link to the upstream bug? I'd like to follow if possible (maybe other people on this would as well.)

fredex42 commented 6 years ago

just wanted to say thanks for this, CPU_limit fix worked for me too after a frustrating afternoon of VMware builds just locking up for no reason

thor commented 6 years ago

Just another nudge to @chris-david-taylor in linking to the upstream bug, so that we could follow it to the extent possible. I couldn't find the issue in govmomi, but I could easily have been searching for the wrong thing.

chris-david-taylor commented 6 years ago

Sorry, I've been away @thor - Darn it, is this still a problem? I'll dig it out later today, and if I can't find it, I'll refile.

thor commented 6 years ago

@chris-david-taylor I can do a quick check with a build from the latest govmomi sources, if that's what you had in mind? :)

chris-david-taylor commented 6 years ago

If you could please @thor that would be great. If the issue persists I'll pass it up on to the govmomi maintainers. :)

riponbanik commented 6 years ago

Thanks Guys. CPU Limit is the issue

mkuzmin commented 6 years ago

I'm sorry this took so much time. Here is a new release: https://github.com/jetbrains-infra/packer-builder-vsphere/releases/tag/v2.0.1