Closed chris-david-taylor closed 6 years ago
Further debug; I just spotted this morning for Windows 7; "Windows cannot apply the DiskConfiguration in Autounattend.xml".
Well I spent the entire day on this.. and I probably should have come here first.
I am seeing the same situation with vSphere 6.5 w/DRS and the release version of the packer-builder-vsphere-iso.exe. At first I thought it was because in the release version the disk size is in MB and not GB so I went from an 80GB partition to an 80MB partition. But after I realized that wasn't what was going on I spent pretty much all day today trying to figure out what I had done wrong. The part that through me off is that the vSphere GUI when working with the packer VM that it has created is almost totally unresponsive. Shutting down the VM usually fails or errors out a few times. Getting to the console will hang or not connect at all. The RAM usage generally tries to consume all of the available RAM for the VM as well. CPU doesn't spike. I/O doesn't spikes. Nothing. What I found interesting though is if I left the process running and did a "reset" on the vm through the VMRC the vm booted normally and sped through the setup without and completed quickly. So, there's some interaction with the release version of the plug-in during vm creation and vSphere that wasn't there in prior versions. Initially I thought it was because I was running packer from a different server than I was before. And then I realized that on the new server (running Jenkins) I had downloaded a newer version of the plug-in. As soon as I changed the plug-in for the pre-release version and fixed the disk size (as it was now trying to create in 80000GB drive) it worked as expected. The version of the plugin that works for me -I don't know the version number - but says it from 4/12/18. If there is additional logging or anything else I can do to help you troubleshoot this please let me know. It is very easy to reproduce.
@embusalacchi looking like it. We'll need some insight on what might have changed between the 2.0beta4 release and the 2.0 release. I've been trying to go build by build from the public teamcity server located here but I don't really have the time to do so and keep these hung VMs in my inventory. I also don't know which build corresponds to the 2.0beta4 release so I can work up from there.
Last commit for 2.0-beta4 was 15th of March to add Cluster Support.
I did a build from 25th April after commit #82 and the issue isn't present then. Hope that is of some help @sudomateo ?
@chris-david-taylor thank you sir. I'll check that build out.
are you using the boot_cmd in packer?? my VMs lock up after this??
I'm not @kempy007. It's not the boot command that is the issue. @embusalacchi suggested these might be all related, possibly to floppy_media. First of all, does 2.0beta4 work for you, and what OS are you templating?
I'm not an expert in Golang and currently working 12 hour days, otherwise I'd learn a bit and investigate, but as long as you aren't desperate for winrm, then building from #82 should be OK. Are you comfortable doing that?
I don't mind trying other builds but I don't have the means to build them on my own. Is there a link somewhere? I don't mind trying them as I have time to nail down when it went bad.
RHEL6, packer is now 1.2.4. Issue only occurs with boot_cmd and after it is invoked the VM never uses more than 30mhz thus appears hung. restart and shutdown from vmwrc may fail. ctrl + alt + del inside vm does reboot.
I think it maybe related which is why I wanted to know if you have boot_cmd in your packer file.
I am not using the boot_cmd in my packer file.
@embusalacchi Builds can be found here: https://teamcity.jetbrains.com/viewType.html?buildTypeId=PackerVSphere_Build&branch_PackerVSphere=%3Cdefault%3E&tab=buildTypeStatusDiv
Just log in as guest and download the build you want.
Also I am using boot_command
in my packer file.
I'm not using the boot_cmd parameter @kempy007
Thanks @sudomateo :)
found a line in esxi logs:
2018-06-12T09:57:39.190Z host001.local Hostd: warning hostd[F3C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplate009.vmx opID=32cd7e53-01-1b-5afc user=vpxuser:foobar] CannotRetrieveCorefiles: VM is in an invalid state
2018-06-12T09:57:39.225Z host001.local Hostd: warning hostd[F3C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplate009.vmx opID=32cd7e53-01-1b-5afc user=vpxuser:foobar] File - failed to get objectId, '/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplatte009.vmx': One of the parameters supplied is invalid.
seems to be invalid parameter in the vmx file:
sched.mem.pin = "TRUE"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "4096"
sched.mem.minSize = "4096"
sched.mem.shares = "normal"
=> https://kb.vmware.com/s/article/2085907
maybe this is the issue, if i remove the lines above, the vm is not hanging :)
Here's the generated .vmx from the not-working version of the plugin (vsphere-iso):
.encoding = "UTF-8" config.version = "8" virtualHW.version = "11" nvram = "windows2016-full-L1.ptest1.nvram" pciBridge0.present = "TRUE" svga.present = "TRUE" pciBridge4.present = "TRUE" pciBridge4.virtualDev = "pcieRootPort" pciBridge4.functions = "8" pciBridge5.present = "TRUE" pciBridge5.virtualDev = "pcieRootPort" pciBridge5.functions = "8" pciBridge6.present = "TRUE" pciBridge6.virtualDev = "pcieRootPort" pciBridge6.functions = "8" pciBridge7.present = "TRUE" pciBridge7.virtualDev = "pcieRootPort" pciBridge7.functions = "8" vmci0.present = "TRUE" hpet0.present = "TRUE" numvcpus = "2" memSize = "16384" sched.cpu.units = "mhz" scsi0.virtualDev = "pvscsi" scsi0.present = "TRUE" scsi0:0.deviceType = "scsi-hardDisk" scsi0:0.fileName = "windows2016-full-L1.ptest1.vmdk" scsi0:0.present = "TRUE" ethernet0.virtualDev = "vmxnet3" ethernet0.dvs.switchId = "46 65 0b 50 2c 39 e7 3b-ef da 69 ea 94 a4 ec e0" ethernet0.dvs.portId = "168" ethernet0.dvs.portgroupId = "dvportgroup-36" ethernet0.dvs.connectionId = "1675707751" ethernet0.addressType = "vpx" ethernet0.generatedAddress = "00:50:56:8b:4c:07" ethernet0.uptCompatibility = "TRUE" ethernet0.present = "TRUE" displayName = "windows2016-full-L1.ptest1" guestOS = "windows9srv-64" uuid.bios = "42 0b 30 ac 5e 15 7d d5-61 ba be b0 2d a3 d4 51" vc.uuid = "50 0b b5 4e a6 6e d7 77-ab 89 74 59 d3 f2 b6 61" sata0.present = "TRUE" sata0:0.deviceType = "cdrom-image" sata0:0.fileName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/ISOs/SW_DVD9_Win_Svr_STD_Core_and_DataCtr_Core_2016_64Bit_English_-2_MLF_X21-22843.ISO" sata0:0.present = "TRUE" sata0:1.deviceType = "cdrom-image" sata0:1.fileName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/ISOs/VMware-tools-windows-10.0.9-3917699.iso" sata0:1.present = "TRUE" floppy0.fileType = "file" floppy0.fileName = "packer-tmp-created-floppy.flp" bios.hddOrder = "scsi0:0" bios.bootOrder = "hdd,cdrom,cdrom" sched.cpu.min = "0" sched.cpu.shares = "normal" sched.mem.min = "0" sched.mem.minSize = "0" sched.mem.shares = "normal" floppy0.clientDevice = "FALSE" virtualHW.productCompatibility = "hosted" sched.swap.derivedName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/windows2016-full-L1.ptest1/windows2016-full-L1.ptest1-49feaf28.vswp" uuid.location = "56 4d 28 31 3d 59 be 45-05 42 a4 45 e6 44 d4 bb" replay.supported = "FALSE" replay.filename = "" migrate.hostlog = "./windows2016-full-L1.ptest1-49feaf28.hlog" scsi0:0.redo = "" pciBridge0.pciSlotNumber = "17" pciBridge4.pciSlotNumber = "21" pciBridge5.pciSlotNumber = "22" pciBridge6.pciSlotNumber = "23" pciBridge7.pciSlotNumber = "24" scsi0.pciSlotNumber = "160" ethernet0.pciSlotNumber = "192" vmci0.pciSlotNumber = "32" sata0.pciSlotNumber = "33" scsi0.sasWWID = "50 05 05 6c 5e 15 7d d0" vmci0.id = "765711441" vm.genid = "7927475490584792074" vm.genidX = "3729250489358533764" monitor.phys_bits_used = "42" vmotion.checkpointFBSize = "4194304" vmotion.checkpointSVGAPrimarySize = "4194304" cleanShutdown = "FALSE" softPowerOff = "FALSE"
What's strange is if you do a RESET on the VM it will boot perfectly and go through the install. If you compare the .vmx when "broken" and the .vmx after the reset it is identical. So, it's not entirely clear where the issue is unless it boots with a bad value but vSphere writes out a good value that it uses when it boots after the reset?
Hi @schmandforke, Can you possibly try and get the vmx file generated by 2.0beta4 and then do a “diff” and post it here please? I think there could be a workaround; if we know what the specific invalid parameters are, we might be able to set them in the vmx config of the packerfile.
@chris-david-taylor here you go - this is from beta4 - windows2016-fbeta4.vmx.txt
@chris-david-taylor from v2 - windows2016-full-L1.v2.vmx.txt
@chris-david-taylor I was looking through the go code (I don't really know go) to see if I could figure out what's being set wrong. If you look at my previous comments the VM works after a reset (and not changing anything else). The .vmx from the slow version and the fast version after the reset seems identical so it's almost like it starts up with a bad param but vSphere fixes it? Not sure - I might just be missing something.
Hi @embusalacchi,
I've looked at those logs and the difference is that the plugin seems to now set them where as before it didn't.
I'd say we need to add something like this to our packerfiles, but we'll have to experiment to find what the correct values should be. Maybe you can grab those from the console in vSphere? I'm not back in work until tomorrow to test though:
"vmx_data": { "sched.mem.pin": "TRUE", "sched.cpu.min": "0", "sched.cpu.shares": "normal", "sched.mem.min": "4096", "sched.mem.minSize": "4096", "sched.mem.shares": "normal", "sched.cpu.units": "mhz" }
Yeah I think you're onto something here. When you look at the settings in vCenter, "CPU Limit" is set to "0MHz" instead of what it would normally be which is "Unlimited"
Following that suspicion, I think I now have a viable workaround:
"CPU_limit": -1,
In your .json
seems to do the trick.
confirmed that
"CPU_limit": -1
worked for me !
Yay! Let’s leave this open as it will hopefully help with debugging.
I've also had the problem that my ubuntu-1604 installation got stuck at the setup. Can confirm, that it works with CPU_limit set to -1 👍
I really, really don't know very much about Go, but I did some digging and I think the bug might be on Line 40400 (not a typo) of this file: https://raw.githubusercontent.com/vmware/govmomi/master/vim25/types/types.go
40395 type ResourceAllocationInfo struct {
40396 DynamicData
40397
40398 Reservation *int64 `xml:"reservation"`
40399 ExpandableReservation *bool `xml:"expandableReservation"`
40400 Limit *int64 `xml:"limit"`
40401 Shares *SharesInfo `xml:"shares,omitempty"`
40402 OverheadLimit *int64 `xml:"overheadLimit"`
40403 }
I get the feeling that it should be "xml:"limit,omitempty"
, potentially meaning that this could be an upstream bug.
I tried searching through this project to determine how defaults were set, but there don't seem to be any so I'm not sure how the project maintainer would want to work around this bug by setting one for CPULimit
@mkuzmin We could sure use your help here! Thanks :)
I don't know enough about go either - but I believe some values are being pulled in from the vmware Go API project.
Can confirm "CPU_limit": -1,
works for me too. Running ESXi 6.5 using the latest 2.0 plugin release. I'm building CentOS 7 machines.
"CPU_limit": -1 worked for me too in ESXi 6.5. Thanks.
"CPU_limit": -1,
worked in problem environment for me too.
Seems to be in older version than 6.5 update2 of esxi image.
Confirmed issue is present in version 'ESXi 6.5 U1 VMSA-2018-0004.3*'
Can someone update readme.md to add the above workaround as strongly recommended to avoid this issue?
Can someone update readme.md to add the above workaround as strongly recommended to avoid this issue?
I'd argue instead that this needs to be fixed so that "Unlimited" is the default ...... or perhaps there's a way to consume an object from the API that actually reveals the cluster defaults?
I'd really hope that this doesn't just become some obligatory settings.
@xenithorb I agree. This needs to be addressed in the code, whether upstream or in this plugin.
@sudomateo - I’ll file a bug upstream with VMware at some point today. :)
Works for me. ESXi 6.0, 2.0 vsphere-iso plugin, CentOs 7 Minimal
FWIW I'm seeing behaviour like this regularly - especially on Win-10 machines when I apply the cumulative updates. Resetting through VMRC helps but machine goes to hang again. Researching it with out infrastructure folks to see if there is any issues with our vCenter.
Confirming that adding "CPU_limit": -1
improves things a lot.
Also set
"svga.vramSize" : "134217728",
"svga.autodetect" : "FALSE",
"svga.maxWidth" : "1680",
"svga.maxHeight" : "1050"
The CPU_limit
fix worked for me as well. Can this at least be set as the default? It will probably save lots of people a lot of time.
The fix belongs in VMware’s upstream libraries. I’ve submitted a bug which I should check up on, as I’m starting to write my own code that depends on the upstream.
Got it, thanks @chris-david-taylor. Is there a link to the upstream bug? I'd like to follow if possible (maybe other people on this would as well.)
just wanted to say thanks for this, CPU_limit
fix worked for me too after a frustrating afternoon of VMware builds just locking up for no reason
Just another nudge to @chris-david-taylor in linking to the upstream bug, so that we could follow it to the extent possible. I couldn't find the issue in govmomi, but I could easily have been searching for the wrong thing.
Sorry, I've been away @thor - Darn it, is this still a problem? I'll dig it out later today, and if I can't find it, I'll refile.
@chris-david-taylor I can do a quick check with a build from the latest govmomi sources, if that's what you had in mind? :)
If you could please @thor that would be great. If the issue persists I'll pass it up on to the govmomi maintainers. :)
Thanks Guys. CPU Limit is the issue
I'm sorry this took so much time. Here is a new release: https://github.com/jetbrains-infra/packer-builder-vsphere/releases/tag/v2.0.1
The v2.0 plugin seems to have a bug, regarding installing Windows. (I haven't tried others yet.). My present lab runs on vSphere 6.5.
Steps to reproduce;
Confirmed on Windows 2012_r2, and Windows 7.
I'll try and get some logging out of our environment tomorrow, my permissions are too locked down for me to look. Part of me thinks this may be related to #112 I have also tried updating Packer to 1.2.3.