Closed toothrot closed 1 year ago
The state of support for virtualization of darwin/arm64 is poor. There is no solution in the near future for hosting a Mac VM on ARM64 in our current environment (VMWare on MacStadium) or in our qemu-hvm environments. This is due to a number of issues, not limited to the fact that darwin/arm64 uses an iOS like boot-process instead of a typical PC boot process.
However, the AWS buildlets may work well for us. The workflow of creating a macOS instance on AWS is roughly:
This comes with some caveats. After an instance is stopped, the Dedicated Host is apparently re-imaged by AWS (in a "Pending" state for about an hour). This means, an instance per-buildlet is prohibitively expensive at our volume, as a downtime of an hour between builds would necessitate a very large pool of Dedicated Hosts for our hundreds of darwin builds per day that we do.
We could avoid some of the re-image downtime issue by re-using instances between builds. This has a bit of an issue with keeping environments pristine, but we may be able to automate some of that away with APFS filesystem snapshots, and retiring the instances on some interval.
Finally, for releaselets, we could ensure that we always use a fresh VM that hasn't been tainted by previous builds running on it.
This is less featureful than our always-fresh image approach that we have on MacStadium / VMWare, but it would allow for a somewhat-clean environment for our darwin/ARM64 builders without too much overhead.
Finally, this approach could be re-used for all of our mac builders. This would have a massive benefit of reducing the effort needed to build and maintain new macOS images for each release, which has now doubled with two processor architectures. Our existing approach for AWS AMIs using Packer should work nicely in this scenario.
Documentation on automatic Dedicated Host provisioning and Releasing: https://docs.aws.amazon.com/license-manager/latest/userguide/host-resource-groups.html
I'm starting prototyping with this testing out AWS Mac instances, first by creating some standard reverse builders.
https://go.dev/cl/430696 added an AWS darwin-amd64 reverse builder (darwin-amd64-12-aws
), which I've set up manually with a launchd service.
It actually almost "just works". All of the x/ repos seem to be passing. The main Go repo fails on a specific test, os/signal.TestDetectNohup
and TestNohup
: https://build.golang.org/log/bee053b56d2f0d612725e4427f5640cdad5cad34
This failure is oddly specific: the system /usr/bin/nohup binary is unhappy. We've had this issue before in #5135 inside of tmux. This is also a common problem online, though frustratingly I have yet to find a concrete description of the problem, just workarounds.
One mention is that sshd must have PAM enabled. The AWS sshd config does indeed disable PAM, so that may be related. (Though my launchd service isn't run through ssh, so it isn't clear how that is related).
I'm going to investigate running QEMU on these instances, so I'll pause investigation into os/signal
for now, since it may just not be an issue in QEMU guests.
Change https://go.dev/cl/432115 mentions this issue: cmd/buildet: allow halt of macOS QEMU VMs
Change https://go.dev/cl/432396 mentions this issue: dashboard: increase hast-darwin-amd64-12-aws count
Change https://go.dev/cl/432395 mentions this issue: env/darwin: AWS darwin instances
We now have three hosts running six reverse builder VMs fully set up and (almost) ready.
There is one failing test (https://build.golang.org/log/c40b5c45d0dc28318fd9ad0149efddfe39ff27d7) because an extra deprecation warning printed by bash.
Once the builds are working next steps (short and long term) will be:
Change https://go.dev/cl/432857 mentions this issue: dashboard: add all AWS darwin-amd64 builders
Change https://go.dev/cl/432856 mentions this issue: env/darwin/aws: don't quote extra args
Change https://go.dev/cl/432860 mentions this issue: env/darwin/aws: update docs
Change https://go.dev/cl/432859 mentions this issue: dashboard: make darwin-amd64-aws race builder actually run race
Change https://go.dev/cl/442255 mentions this issue: env/darwin/aws: switch to vmnet-shared networking
Change https://go.dev/cl/448435 mentions this issue: dashboard: add darwin 13 (Ventura) amd64 builders on AWS
Change https://go.dev/cl/449877 mentions this issue: cmd/runqemubuildlet: add darwin support
Change https://go.dev/cl/449876 mentions this issue: cmd/runqemubuildlet: select windows support with a flag
Change https://go.dev/cl/449875 mentions this issue: env/darwin/aws: assign static IPs to each guest
Change https://go.dev/cl/453956 mentions this issue: dashboard,internal/releasetargets: run AMD64 Macs AWS, build 1.20 with 13
Change https://go.dev/cl/456055 mentions this issue: cmd/runqemubuildlet: use
sudo killto signal on darwin
Change https://go.dev/cl/456042 mentions this issue: cmd/runqemubuildlet: run as root on darwin
Currently our AWS darwin-amd64 builder guests take ~4 minutes to boot. This is much slower than guests on MacStadium (I'm told those were closer to 10s). 4 minute boot time is a significant drag on capacity. Subrepo tests are often much shorter than that, meaning we spend more time booting than we do running tests.
I dug into this a bit yesterday:
A profile shows ~15% of all cycles spent in hvf_vcpu_exec
->
qemu_mutex_lock_iothread
/ qemu_mutex_unlock_iothread
, which sounds like lock contention to me. Indeed, this lock is pretty much held unconditionally for the duration
of all VM exits:
https://gitlab.com/qemu-project/qemu/-/blob/master/target/i386/hvf/hvf.c#L453. OTOH, the KVM backend avoids taking this lock for many (but not all) VM exit reasons.
I sent a mail to qemu-discuss@nongnu.org about this, but I'm not sure it went through, as it doesn't appear on the mailing list archive.
Anyways, a quick improvement will be to switch to 4 CPUs, which hopefully is still enough to avoid test timeouts.
I added tracing to QEMU, and these are the VM exit reason counts during boot:
25338097 hvf_vcpu_exit: exit reason 48 (EPT violation)
1465860 hvf_vcpu_exit: exit reason 7 (Interrupt window)
955636 hvf_vcpu_exit: exit reason 1 (External interrupt)
532542 hvf_vcpu_exit: exit reason 12 (HLT instruction)
80699 hvf_vcpu_exit: exit reason 30 (IO instruction)
3485 hvf_vcpu_exit: exit reason 10 (CPUID)
1597 hvf_vcpu_exit: exit reason 31 (RDMSR)
117 hvf_vcpu_exit: exit reason 28 (CR access)
69 hvf_vcpu_exit: exit reason 32 (WRMSR)
7 hvf_vcpu_exit: exit reason 55 (XSETBV)
The only one here that surprises me is HLT exits. It looks like XNU may use this to enter an idle state when a more complex power management subsystem is not (yet?) available: https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/i386/pmCPU.c#L176
Change https://go.dev/cl/461775 mentions this issue: env/darwin/aws: reduce guest CPU count to 4
For reference, buildlet wait times (get_buildlet
, seconds):
Before AWS switch (2022-10-15 through 2022-11-29):
Builder | p10 | p50 | p90 | p99 |
---|---|---|---|---|
darwin-amd64-10_14 | 0.035087506 | 57.426122709 | 5409.986448872 | 27029.738440022 |
darwin-amd64-10_15 | 0.035315248 | 51.394408839 | 6330.141066737 | 33205.196756675 |
darwin-amd64-11_0 | 0.03502191 | 17.052950941 | 1265.639361792 | 18768.733989156 |
darwin-amd64-12_0 | 0.036329005 | 113.757215006 | 27884.264082743 | 60902.939780661 |
darwin-amd64-nocgo | 0.036361308 | 152.538390709 | 33514.425559542 | 60922.292660581 |
After AWS switch (2022-11-30 through 2023-01-12):
Builder | p10 | p50 | p90 | p99 |
---|---|---|---|---|
darwin-amd64-10_14 | 0.030347397 | 129.170084819 | 16868.565147715 | 51138.059929517 |
darwin-amd64-10_15 | 0.032052233 | 165.560549426 | 21411.865571222 | 53629.770950752 |
darwin-amd64-11_0 | 0.03205239 | 221.267568342 | 26238.450715285 | 64911.760616257 |
darwin-amd64-12_0 | 0.03138795 | 175.994555331 | 25429.052805333 | 67438.227785737 |
darwin-amd64-13 | 0.038021979 | 4120.744229468 | 370565.129423636 | 429429.994737281 |
darwin-amd64-nocgo | 0.031963242 | 294.516531008 | 36972.023496533 | 66837.479161523 |
Time running tests (make_and_test
, seconds):
Before:
Builder | p10 | p50 | p90 | p99 |
---|---|---|---|---|
darwin-amd64-10_14 | 1109.337714283 | 1338.203048197 | 1438.635656949 | 1496.908947744 |
darwin-amd64-10_15 | 1142.797376231 | 1345.846128948 | 1442.504664356 | 1515.9424281 |
darwin-amd64-11_0 | 1321.768913047 | 1532.862195206 | 1646.776306831 | 1738.837314762 |
darwin-amd64-12_0 | 1224.030957425 | 1502.473005047 | 1706.57681256 | 1789.397041412 |
darwin-amd64-nocgo | 832.869971342 | 1125.187291301 | 1314.866512538 | 1381.113436051 |
After:
Builder | p10 | p50 | p90 | p99 |
---|---|---|---|---|
darwin-amd64-10_14 | 1026.817217975 | 1941.302826298 | 2523.380160588 | 2841.733411426 |
darwin-amd64-10_15 | 1050.692744331 | 1811.730996635 | 2426.353868054 | 2724.949244972 |
darwin-amd64-11_0 | 2188.55871373 | 2517.004398094 | 2960.169886024 | 3044.606037487 |
darwin-amd64-12_0 | 1949.949176643 | 2486.049366062 | 2748.16159951 | 2832.066993053 |
darwin-amd64-13 | 2200.732157566 | 2933.495288619 | 3134.77419223 | 3351.202110408 |
darwin-amd64-nocgo | 1463.312098452 | 1966.804602269 | 2125.788129161 | 2187.764026593 |
Edit: these are for the Go repo only, not subrepos.
Should we close this issue and mark it as completed? Any additional problems we find can be addressed in more specific issues if necessary.
Closing this since it doesn't seem like there's anything in particular left.
Change https://go.dev/cl/484746 mentions this issue: Revert "env/darwin/aws: reduce guest CPU count to 4"
Currently, the darwin-arm64 buildlets do not run in a clean VM for each test run. Now that QEMU has progressed a bit, we should try running them in a VM again.