munki / createOSXinstallPkg

Tools for packaging OS X installers
Other
317 stars 47 forks source link

package installs successfully, but OS installation does not kick off after reboot. #2

Closed elvisizer closed 9 years ago

elvisizer commented 9 years ago

Hi greg, I'm seeing an intermittant issue with deploying 10.10.1 using createOSXinstallPkg. In some installation attempts, the package created by createOSXinstallPkg will install successfully, but when the machine is rebooted, the OS installation does not actually begin. So far, each case has been solved by running the same pkg again- on the 2nd try, the installation does run. We're seeing this with probably 5-10% of attempted installations. I haven't had any luck tracking down what's causing this in log files. Any ideas where I should start looking, or what might be going on here?

gregneagle commented 9 years ago

When the OS installation doesn't begin -- what does happen?

Is it possible something is causing the reboot to happen a long time after the install is set up?

https://github.com/munki/createOSXinstallPkg/blob/master/Resources/installosxpkg_postflight#L615-L635

If it's "too long" (for some internal-to-Apple definition of "too long", the automated install does not happen, and instead you boot to a UI that looks very similar to the one on the Recovery HD, where you can manually initiate the OS install (or run Disk Utility, etc)

elvisizer commented 9 years ago

We're not seeing that- the affected machines boot from their internal drives. I haven't been able to establish if this occurs on non-encrypted macs, but we did get a report of the boot behavior on a mac with FV2- on that mac, the FV2 disk unlock screen appeared after the reboot. The password field's pips were unusually small. After authenticating, they got the standard OS X login window. Since the small pips behavior is always present when booting into the installer, it seems like the disk unlock screen is trying to start the installer, but then something goes wrong and the OS on the hard drive boots instead.

gregneagle commented 9 years ago

Have you looked on the affected machines to see if the OS X install components are present? You'll find them in a folder named "OS X Install Data" at the root of the drive to be upgraded.

Do you have any other packages/scripts/processes that might be affecting the "bless" status of the startup disk before the machine is restarted? Do you have firmware password set on any of these machines? Anything that sets/alters the startup disk will affect/break the OS X install.

elvisizer commented 9 years ago

The install components are present. We don't have anything that's making any changes to the bless status or the startup disk- I did a sanity check just to be sure, and there's nothing in our environment touching that. No firmware passwords.

elvisizer commented 9 years ago

I am installing the OS pkg from a script, so I can gather some data using that script on the state of the machine after the pkg has been installed, but not rebooted yet. I'm just not sure where to start, or what to look at. Any suggestions for that?

elvisizer commented 9 years ago

got a VM right now that's doing it . . . . .Any ideas for things to check now that I've got my hands on an example of the issue?

elvisizer commented 9 years ago

here's the output of bless --info. I took this after the OS pkg installed, the mac rebooted, and did not begin the installation. bless info after not starting the install I've also got the system.log for that boot if you would like to see it.

elvisizer commented 9 years ago

nvram -p output from the test example screen shot 2015-03-11 at 4 00 01 pm

elvisizer commented 9 years ago

And, here's the system.log . . . tried to guess which lines would be useful, if you need more from this file, let me know . . . .

Mar 11 16:03:28 localhost bootlog[0]: BOOT_TIME 1426115008 0 Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.appstore" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.authd" sharing output destination "/var/log/system.log" with ASL Module "com.apple.asl". Output parameters from ASL Module "com.apple.asl" override any specified in ASL Module "com.apple.authd". Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.authd" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.bookstore" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.eventmonitor" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.install" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.iokit.power" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.mail" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.MessageTracer" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.performance" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 localhost syslogd[19]: Configuration Notice: ASL Module "com.apple.securityd" claims selected messages. Those messages may not appear in standard system log files or in the ASL database. Mar 11 16:03:29 --- last message repeated 6 times --- Mar 11 16:03:29 localhost kernel[0]: Longterm timer threshold: 1000 ms Mar 11 16:03:29 localhost kernel[0]: PMAP: PCID enabled Mar 11 16:03:29 localhost kernel[0]: PMAP: Supervisor Mode Execute Protection enabled Mar 11 16:03:29 localhost kernel[0]: Darwin Kernel Version 13.4.0: Wed Dec 17 19:05:52 PST 2014; root:xnu-2422.115.10~1/RELEASE_X86_64 Mar 11 16:03:29 localhost kernel[0]: vm_page_bootstrap: 483251 free pages and 41037 wired pages Mar 11 16:03:29 localhost kernel[0]: kext submap [0xffffff7f807ab000 - 0xffffff8000000000], kernel text [0xffffff8000200000 - 0xffffff80007ab000] Mar 11 16:03:29 localhost kernel[0]: zone leak detection enabled Mar 11 16:03:29 localhost kernel[0]: "vm_compressor_mode" is 4 Mar 11 16:03:29 localhost kernel[0]: standard timeslicing quantum is 10000 us Mar 11 16:03:29 localhost kernel[0]: standard background quantum is 2500 us Mar 11 16:03:29 localhost kernel[0]: mig_table_max_displ = 74 Mar 11 16:03:29 localhost kernel[0]: TSC Deadline Timer supported and enabled Mar 11 16:03:29 localhost kernel[0]: AppleACPICPU: ProcessorId=0 LocalApicId=0 Enabled Mar 11 16:03:29 localhost kernel[0]: AppleACPICPU: ProcessorId=1 LocalApicId=2 Enabled Mar 11 16:03:29 localhost kernel[0]: calling mpo_policy_init for TMSafetyNet Mar 11 16:03:29 localhost kernel[0]: Security policy loaded: Safety net for Time Machine (TMSafetyNet) Mar 11 16:03:29 localhost kernel[0]: calling mpo_policy_init for Sandbox Mar 11 16:03:29 localhost kernel[0]: Security policy loaded: Seatbelt sandbox policy (Sandbox) Mar 11 16:03:29 localhost kernel[0]: calling mpo_policy_init for Quarantine Mar 11 16:03:29 localhost kernel[0]: Security policy loaded: Quarantine policy (Quarantine) Mar 11 16:03:29 localhost kernel[0]: Copyright (c) 1982, 1986, 1989, 1991, 1993 Mar 11 16:03:29 localhost kernel[0]: The Regents of the University of California. All rights reserved. Mar 11 16:03:29 localhost kernel[0]: MAC Framework successfully initialized Mar 11 16:03:29 localhost kernel[0]: using 10485 buffer headers and 7290 cluster IO buffer headers Mar 11 16:03:29 localhost kernel[0]: AppleKeyStore starting (BUILT: Aug 17 2014 20:21:39) Mar 11 16:03:29 localhost kernel[0]: IOAPIC: Version 0x11 Vectors 64:87 Mar 11 16:03:29 localhost kernel[0]: ACPI: sleep states S4 S5 Mar 11 16:03:29 localhost kernel[0]: pci (build 20:04:33 Aug 17 2014), flags 0x63008, pfm64 (42 cpu) 0x3ff80000000, 0x80000000 Mar 11 16:03:29 localhost kernel[0]: [ PCI configuration begin ] Mar 11 16:03:29 localhost kernel[0]: mcache: 2 CPU(s), 64 bytes CPU cache line size Mar 11 16:03:29 localhost kernel[0]: mbinit: done [64 MB total pool size, (42/21) split] Mar 11 16:03:29 localhost kernel[0]: Pthread support ABORTS when sync kernel primitives misused Mar 11 16:03:29 localhost kernel[0]: rooting via boot-uuid from /chosen: C453948D-EF70-38AB-B296-8F80F625339D Mar 11 16:03:29 localhost kernel[0]: Waiting on IOProviderClassIOResourcesIOResourceMatchboot-uuid-media Mar 11 16:03:29 localhost kernel[0]: com.apple.AppleFSCompressionTypeZlib kmod start Mar 11 16:03:29 localhost kernel[0]: com.apple.AppleFSCompressionTypeLZVN kmod start Mar 11 16:03:29 localhost kernel[0]: com.apple.AppleFSCompressionTypeDataless kmod start Mar 11 16:03:29 localhost kernel[0]: com.apple.AppleFSCompressionTypeZlib load succeeded Mar 11 16:03:29 localhost kernel[0]: com.apple.AppleFSCompressionTypeLZVN load succeeded Mar 11 16:03:29 localhost kernel[0]: com.apple.AppleFSCompressionTypeDataless load succeeded Mar 11 16:03:29 localhost kernel[0]: console relocated to 0xf0000000 Mar 11 16:03:29 localhost kernel[0]: [ PCI configuration end, bridges 35, devices 12 ] Mar 11 16:03:29 localhost kernel[0]: USBF: 0.175 AppleUSBEHCI::CheckSleepCapability - controller will be unloaded across sleep Mar 11 16:03:29 localhost kernel[0]: USBF: 0.185 AppleUSBEHCI::CheckSleepCapability - controller will be unloaded across sleep Mar 11 16:03:29 localhost kernel[0]: Intel82574L::start - Built Aug 17 2014 20:10:03 -- running on device at b3 d0 f0 Mar 11 16:03:29 localhost kernel[0]: CoreStorage: fsck_cs has finished for group "49E4F54E-6041-48A4-AB21-E314D7E21A99" with status 0x00 Mar 11 16:03:29 localhost kernel[0]: Got boot device = IOService:/AppleACPIPlatformExpert/PCI0@0/AppleACPIPCI/P2P0@11/IOPP/S5F0@4/AppleAHCI/PRT0@0/IOAHCIDevice@0/AppleAHCIDiskDriver/IOAHCIBlockStorageDevice/IOBlockStorageDriver/VMware Virtual SATA Hard Drive Media/IOGUIDPartitionScheme/Macintosh HD@2/CoreStoragePhysical/Macintosh HD/Macintosh HD Mar 11 16:03:29 localhost kernel[0]: BSD root: disk1, major 1, minor 4 Mar 11 16:03:29 localhost kernel[0]: hfs: mounted Macintosh HD on device root_device Mar 11 16:03:28 localhost com.apple.launchd[1]: * launchd[1] has started up. * Mar 11 16:03:28 localhost com.apple.launchd[1]: * Verbose boot, will log to /dev/console. *

gregneagle commented 9 years ago

bless --info looks right.

nvram stuff: I wonder about the differing UUIDs in the efi-boot-device and the install-product-url; they seem to point to different disks.

Use diskutil list to get a list of identifiers, thenL diskutil info diskXsY for the each identifier to find the UUIDs and see how they match up to what you are seeing in the nvram settings.

elvisizer commented 9 years ago

here's the diskutil list output: diskutil -list

The UUID listed in nvram for the x-osproduct corresponds to /dev/disk1: c45 dev disk1

The other UUID (AF9B04AE-86E8-4D6B-AA8B-D2C87A5ABB5E) doesn't appear anywhere that I can find in diskutil's output for any device or disk.

I captured the nvram-p output from the same VM just before a successful run as well (yay for snapshots!) and it looks exactly the same as before an unsuccessful attempt: screen shot 2015-03-11 at 5 53 48 pm

One other random piece of information: I just noticed that after unlocking the disk, if the OS install is going to start, you get the Yosemite-style progress bar. If the OS install is NOT going to start, after unlocking the disk I get a Mavericks-style spinning gear.

elvisizer commented 9 years ago

Any ideas for next steps, Greg?

gregneagle commented 9 years ago

Nope! Unless you have a way to get me a copy of a VM (VMware Fusion format) that displays the behavior...

elvisizer commented 9 years ago

I've got the VM, just need a way to get it to you . . . .You aren't going to MacBrained tonight are you ;)?

gregneagle commented 9 years ago

No, that's a rather long drive from the LA area...

elvisizer commented 9 years ago

heh, well, it's big VM file since FV2 is enabled. Maybe it'd be easier to set up a 10.9 VM on your side, then I can send you our installer pkg and instructions for forcing the problem? I'd be fine with giving you my VM itself, just not sure how to get such a big file to you without shipping a USB drive.

gregneagle commented 9 years ago

If you have a repeatable, reliable way to reproduce the problem in a way I can look at it, I'm game to try your instructions.

elvisizer commented 9 years ago

ok, I'll write that up as soon as I can . . . should have that for you tomorrow. And thanks again for continuing to work with me on this, really appreciate it!!

elvisizer commented 9 years ago

Ok, I THINK I've solved this. It looks like the problem was due to uninstalling Symantec Endpoint Protection before running the OS install pkg from script on an encrypted Mac. The problem occurs only in that scenario. If I run the SEP uninstaller by double-clicking it's pkg in Finder, and then double-click the OS installer as well, everything works as expected. It's only if the 2 installers are run from a script together, and only on a mac that is encrypted. So, I'm moving the SEP uninstall to a launchd that'll execute AFTER the os installation. So far, no issues kicking off the os install when run from script as long as the SEP uninstaller isn't run first.

gregneagle commented 9 years ago

Follow up question: what is the SEP uninstaller doing that affects the OS X installer boot environment? A look at the package scripts (preinstall/postinstall) might reveal clues (or even a smoking gun)

elvisizer commented 9 years ago

It's a shell script from symantec that I've wrapped in a pkg, so there's just one script to vet. I took a look through it, and I really don't see what could be doing it. The most suspect part seems like it might be the function that re-writes the loginwindow.plist file, but that seems like a stretch. I don't see anything touching nvram, or anything related to the startup disk. Here's a link to the script: http://pastebin.com/gEMAkufc

gregneagle commented 9 years ago

So here's my guess:

The script removes several .kext files from /System/Library/Extensions and /Library/Extensions. This triggers a rebuild of the kextcaches. (this can take several minutes to complete.)

This rebuild is (sometimes/often/occassionally) still occurring when the OS X install pkg is doing its thing. Since the kextcache is involved in the boot process, it's possible that if the cache update concludes after the postflight script in the OS X install pkg runs, it affects the boot environment such that the OS X installer boot environment is no longer used and the "normal" OS X boot is back in play.

elvisizer commented 9 years ago

That would make sense . . . . Well, I think we're done here. Thanks again, Greg!