x/build/env/darwin-arm64: run buildlets in a per-build VM

toothrot commented 3 years ago

Currently, the darwin-arm64 buildlets do not run in a clean VM for each test run. Now that QEMU has progressed a bit, we should try running them in a VM again.

toothrot commented 2 years ago

The state of support for virtualization of darwin/arm64 is poor. There is no solution in the near future for hosting a Mac VM on ARM64 in our current environment (VMWare on MacStadium) or in our qemu-hvm environments. This is due to a number of issues, not limited to the fact that darwin/arm64 uses an iOS like boot-process instead of a typical PC boot process.

However, the AWS buildlets may work well for us. The workflow of creating a macOS instance on AWS is roughly:

purchase 24-hr leases of "Dedicated Hosts", marked as auto-assignable
Create mac1.metal VMs that get auto-assigned to the host pool

This comes with some caveats. After an instance is stopped, the Dedicated Host is apparently re-imaged by AWS (in a "Pending" state for about an hour). This means, an instance per-buildlet is prohibitively expensive at our volume, as a downtime of an hour between builds would necessitate a very large pool of Dedicated Hosts for our hundreds of darwin builds per day that we do.

We could avoid some of the re-image downtime issue by re-using instances between builds. This has a bit of an issue with keeping environments pristine, but we may be able to automate some of that away with APFS filesystem snapshots, and retiring the instances on some interval.

Finally, for releaselets, we could ensure that we always use a fresh VM that hasn't been tainted by previous builds running on it.

This is less featureful than our always-fresh image approach that we have on MacStadium / VMWare, but it would allow for a somewhat-clean environment for our darwin/ARM64 builders without too much overhead.

Finally, this approach could be re-used for all of our mac builders. This would have a massive benefit of reducing the effort needed to build and maintain new macOS images for each release, which has now doubled with two processor architectures. Our existing approach for AWS AMIs using Packer should work nicely in this scenario.

toothrot commented 2 years ago

Documentation on automatic Dedicated Host provisioning and Releasing: https://docs.aws.amazon.com/license-manager/latest/userguide/host-resource-groups.html

prattmic commented 2 years ago

I'm starting prototyping with this testing out AWS Mac instances, first by creating some standard reverse builders.

prattmic commented 2 years ago

https://go.dev/cl/430696 added an AWS darwin-amd64 reverse builder (darwin-amd64-12-aws), which I've set up manually with a launchd service.

It actually almost "just works". All of the x/ repos seem to be passing. The main Go repo fails on a specific test, os/signal.TestDetectNohup and TestNohup: https://build.golang.org/log/bee053b56d2f0d612725e4427f5640cdad5cad34

This failure is oddly specific: the system /usr/bin/nohup binary is unhappy. We've had this issue before in #5135 inside of tmux. This is also a common problem online, though frustratingly I have yet to find a concrete description of the problem, just workarounds.

One mention is that sshd must have PAM enabled. The AWS sshd config does indeed disable PAM, so that may be related. (Though my launchd service isn't run through ssh, so it isn't clear how that is related).

I'm going to investigate running QEMU on these instances, so I'll pause investigation into os/signal for now, since it may just not be an issue in QEMU guests.

gopherbot commented 2 years ago

Change https://go.dev/cl/432115 mentions this issue: cmd/buildet: allow halt of macOS QEMU VMs

gopherbot commented 2 years ago

Change https://go.dev/cl/432396 mentions this issue: dashboard: increase hast-darwin-amd64-12-aws count

gopherbot commented 2 years ago

Change https://go.dev/cl/432395 mentions this issue: env/darwin: AWS darwin instances

prattmic commented 2 years ago

We now have three hosts running six reverse builder VMs fully set up and (almost) ready.

There is one failing test (https://build.golang.org/log/c40b5c45d0dc28318fd9ad0149efddfe39ff27d7) because an extra deprecation warning printed by bash.

Once the builds are working next steps (short and long term) will be:

Create instances for older amd64 macOS releases.
Create instances for arm64 macOS releases (if desired).
Investigating putting instances behind NAT. Currently they have public IPs, but all inbound connections blocked.
Investigate smarter scheduling like makemac. Right now we just create guests in a loop with a fixed guest OS version.

gopherbot commented 2 years ago

Change https://go.dev/cl/432857 mentions this issue: dashboard: add all AWS darwin-amd64 builders

gopherbot commented 2 years ago

Change https://go.dev/cl/432856 mentions this issue: env/darwin/aws: don't quote extra args

gopherbot commented 2 years ago

Change https://go.dev/cl/432860 mentions this issue: env/darwin/aws: update docs

gopherbot commented 2 years ago

Change https://go.dev/cl/432859 mentions this issue: dashboard: make darwin-amd64-aws race builder actually run race

gopherbot commented 2 years ago

Change https://go.dev/cl/442255 mentions this issue: env/darwin/aws: switch to vmnet-shared networking

gopherbot commented 2 years ago

Change https://go.dev/cl/448435 mentions this issue: dashboard: add darwin 13 (Ventura) amd64 builders on AWS

gopherbot commented 2 years ago

Change https://go.dev/cl/449877 mentions this issue: cmd/runqemubuildlet: add darwin support

gopherbot commented 2 years ago

Change https://go.dev/cl/449876 mentions this issue: cmd/runqemubuildlet: select windows support with a flag

gopherbot commented 2 years ago

Change https://go.dev/cl/449875 mentions this issue: env/darwin/aws: assign static IPs to each guest

gopherbot commented 1 year ago

Change https://go.dev/cl/453956 mentions this issue: dashboard,internal/releasetargets: run AMD64 Macs AWS, build 1.20 with 13

gopherbot commented 1 year ago

Change https://go.dev/cl/456055 mentions this issue: cmd/runqemubuildlet: usesudo killto signal on darwin

gopherbot commented 1 year ago

Change https://go.dev/cl/456042 mentions this issue: cmd/runqemubuildlet: run as root on darwin

prattmic commented 1 year ago

Currently our AWS darwin-amd64 builder guests take ~4 minutes to boot. This is much slower than guests on MacStadium (I'm told those were closer to 10s). 4 minute boot time is a significant drag on capacity. Subrepo tests are often much shorter than that, meaning we spend more time booting than we do running tests.

I dug into this a bit yesterday:

Disk performance does not seem to be a bottleneck. Moving the disk image from the root EBS volume we currently use to the internal Mac SSD had no impact.
QEMU's hvf (i.e., macOS) backend seems to have scalability problems w.r.t. number of guest CPUs. Boot times with different CPU counts (we currently use 6):
- 1 CPU -> ~1m30s
- 2 CPU -> ~1m15s
- 4 CPU -> ~1m40s
- 6 CPU -> ~4m
- 8 CPU -> ~6m

A profile shows ~15% of all cycles spent in hvf_vcpu_exec -> qemu_mutex_lock_iothread / qemu_mutex_unlock_iothread, which sounds like lock contention to me. Indeed, this lock is pretty much held unconditionally for the duration of all VM exits: https://gitlab.com/qemu-project/qemu/-/blob/master/target/i386/hvf/hvf.c#L453. OTOH, the KVM backend avoids taking this lock for many (but not all) VM exit reasons.

I sent a mail to qemu-discuss@nongnu.org about this, but I'm not sure it went through, as it doesn't appear on the mailing list archive.

Anyways, a quick improvement will be to switch to 4 CPUs, which hopefully is still enough to avoid test timeouts.

prattmic commented 1 year ago

I added tracing to QEMU, and these are the VM exit reason counts during boot:

25338097  hvf_vcpu_exit: exit reason 48 (EPT violation)
1465860   hvf_vcpu_exit: exit reason 7  (Interrupt window)
955636    hvf_vcpu_exit: exit reason 1  (External interrupt)
532542    hvf_vcpu_exit: exit reason 12 (HLT instruction)
80699     hvf_vcpu_exit: exit reason 30 (IO instruction)
3485      hvf_vcpu_exit: exit reason 10 (CPUID)
1597      hvf_vcpu_exit: exit reason 31 (RDMSR)
117       hvf_vcpu_exit: exit reason 28 (CR access)
69        hvf_vcpu_exit: exit reason 32 (WRMSR)
7         hvf_vcpu_exit: exit reason 55 (XSETBV)

The only one here that surprises me is HLT exits. It looks like XNU may use this to enter an idle state when a more complex power management subsystem is not (yet?) available: https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/i386/pmCPU.c#L176

gopherbot commented 1 year ago

Change https://go.dev/cl/461775 mentions this issue: env/darwin/aws: reduce guest CPU count to 4

prattmic commented 1 year ago

For reference, buildlet wait times (get_buildlet, seconds):

Before AWS switch (2022-10-15 through 2022-11-29):

Builder	p10	p50	p90	p99
darwin-amd64-10_14	0.035087506	57.426122709	5409.986448872	27029.738440022
darwin-amd64-10_15	0.035315248	51.394408839	6330.141066737	33205.196756675
darwin-amd64-11_0	0.03502191	17.052950941	1265.639361792	18768.733989156
darwin-amd64-12_0	0.036329005	113.757215006	27884.264082743	60902.939780661
darwin-amd64-nocgo	0.036361308	152.538390709	33514.425559542	60922.292660581

After AWS switch (2022-11-30 through 2023-01-12):

Builder	p10	p50	p90	p99
darwin-amd64-10_14	0.030347397	129.170084819	16868.565147715	51138.059929517
darwin-amd64-10_15	0.032052233	165.560549426	21411.865571222	53629.770950752
darwin-amd64-11_0	0.03205239	221.267568342	26238.450715285	64911.760616257
darwin-amd64-12_0	0.03138795	175.994555331	25429.052805333	67438.227785737
darwin-amd64-13	0.038021979	4120.744229468	370565.129423636	429429.994737281
darwin-amd64-nocgo	0.031963242	294.516531008	36972.023496533	66837.479161523

Time running tests (make_and_test, seconds):

Before:

Builder	p10	p50	p90	p99
darwin-amd64-10_14	1109.337714283	1338.203048197	1438.635656949	1496.908947744
darwin-amd64-10_15	1142.797376231	1345.846128948	1442.504664356	1515.9424281
darwin-amd64-11_0	1321.768913047	1532.862195206	1646.776306831	1738.837314762
darwin-amd64-12_0	1224.030957425	1502.473005047	1706.57681256	1789.397041412
darwin-amd64-nocgo	832.869971342	1125.187291301	1314.866512538	1381.113436051

After:

Builder	p10	p50	p90	p99
darwin-amd64-10_14	1026.817217975	1941.302826298	2523.380160588	2841.733411426
darwin-amd64-10_15	1050.692744331	1811.730996635	2426.353868054	2724.949244972
darwin-amd64-11_0	2188.55871373	2517.004398094	2960.169886024	3044.606037487
darwin-amd64-12_0	1949.949176643	2486.049366062	2748.16159951	2832.066993053
darwin-amd64-13	2200.732157566	2933.495288619	3134.77419223	3351.202110408
darwin-amd64-nocgo	1463.312098452	1966.804602269	2125.788129161	2187.764026593

Edit: these are for the Go repo only, not subrepos.

cagedmantis commented 1 year ago

Should we close this issue and mark it as completed? Any additional problems we find can be addressed in more specific issues if necessary.

heschi commented 1 year ago

Closing this since it doesn't seem like there's anything in particular left.

gopherbot commented 1 year ago

Change https://go.dev/cl/484746 mentions this issue: Revert "env/darwin/aws: reduce guest CPU count to 4"

golang / go

x/build/env/darwin-arm64: run buildlets in a per-build VM #48945