Open UlisesGascon opened 7 months ago
updated on April 19, 2024
SSH port | Node: macpro-4 | Node: macpro-5 | Node: macpro-6 |
---|---|---|---|
8822 | release-macos11-x64-1 | empty | test-macos11-x64-1 |
8823 | empty | empty | test-macos11-x64-2 |
8824 | empty | test-macos1015-x64-2 | test-macos1015-x64-1 |
8825 | empty | empty | empty |
updated on April 22, 2024
Intel Nodes
SSH port | Node: macpro-4 | Node: macpro-5 | Node: macpro-6 |
---|---|---|---|
8822 | release-macos11-x64-1 | test-macos13-x64-2 | test-macos11-x64-1 |
8823 | test-macos13-x64-1 | release-macos13-x64-1 | test-macos11-x64-2 |
8824 | empty | test-macos1015-x64-2 | test-macos1015-x64-1 |
8825 | empty | empty | empty |
ARM Nodes
We assume that ARM Nodes can handle only 2 VMs and not +4 as Intel in the past due license limitations. This needs to be confirmed with support AFAIK?
SSH port | Node: arm-1 | Node: arm-2 | Node: arm-3 |
---|---|---|---|
8822 | test-macos11-arm64-1 | release-macos13-arm64-1 | empty |
8823 | release-macos11-arm64-1 | test-macos13-arm64-1 | test-macos13-arm64-2 |
How Nearform machines are "relocated"?
release-nearform-macos11.0-arm64-1
-> release-orka-macos11-arm64-1
test-nearform-macos11.0-arm64-1
-> test-orka-macos11-arm64-1
release-macos13-x64-2 release-macos13-arm64-2
I don't think it's necessary to have two identical release machines.
test-nearform-macos11.0-arm64-1
Are these typos?
Great feedback @targos! I updated the tables
I don't think it's necessary to have two identical release machines.
We have space for redundancy, but let's remove them for now.
Are these typos?
I made a better reference for the "relocated" machines
release-macos13-x64-2 release-macos13-arm64-2
I don't think it's necessary to have two identical release machines.
Actually, I think we should have one x64 and two arm64 machines, because there are two jobs that run on macos-arm64 during a release (osx11-release-pkg and osx11-arm64-release-tar).
Some questions/thoughts/suggestions:
Node.js does not support a platform version if a vendor has expired support for it. In other words, Node.js does not support running on End-of-Life (EoL) platforms. This is true regardless of entries in the table below.
And the table lists MacOS 11>
.
And that table may be outdated as it seems as though MacOS 11 was EOL as of November 2023 ?
We assume that ARM Nodes can handle only 2 VMs and not +4 as Intel in the past due license limitations. This needs to be confirmed with support AFAIK?
https://orkadocs.macstadium.com/docs/apple-arm-based-support confirms this:
IMPORTANT
You can deploy up to 2 VMs per Apple silicon-based node.
https://github.com/nodejs/build/issues/3592 https://github.com/nodejs/build/issues/3685 (https://github.com/nodejs/build/issues?q=is%3Aissue+macos+is%3Aclosed+disk) etc.
My suggestion to avoid Jenkins worker decay is to lean into an ephemeral node strategy so that each build has a fresh Orka instance to run on.
We can do that with the following Jenkins plugin for Orka: https://plugins.jenkins.io/macstadium-orka/#plugin-content-ephemeral-agents
We would first need to set up a packer build process to create our VM images so that Orka would have a baseline image to create: https://orkadocs.macstadium.com/docs/packer
The packer process can leverage our existing ansible playbooks: https://developer.hashicorp.com/packer/integrations/hashicorp/ansible/latest/components/provisioner/ansible.
This strategy would require that we have an Orka3.0 cluster. Rather than trying to do an upgrade of the existing cluster, I propose that we ask macstadium to allow us to provision a new cluster with the resources we need in it (enough arm/intel backing nodes for our macos11/13 testing and release), get it built/provisioned and working, and then decommission/return all the existing macstadium/orka machines.
I believe this would end up with us using roughly the same amount of resources, so should be palatable for macstadium to support this transition.
This strategy would require that we have an Orka3.0 cluster. Rather than trying to do an upgrade of the existing cluster, I propose that we ask macstadium to allow us to provision a new cluster with the resources we need in it (enough arm/intel backing nodes for our macos11/13 testing and release), get it built/provisioned and working, and then decommission/return all the existing macstadium/orka machines.
+1 from me if Macstadium will support that
Quick update from our last call with MacStadium:
Next week we will have a new Orka cluster (v3) that includes 2 nodes (Intel and ARM):
Mac Studio - G1MC M1M/10/32/16/64GB/2TB/10G
Mac mini G4E - i7/3.2Ghz/6C/64G/1T/SSD/10G
Dependencies
:white_check_mark: Setup Jenkins <-> Orka
orka-test
for test ciorka-release
for release cisa-jenkins-test
sa-jenkins-release
Current status: Completed.
:white_check_mark: Create Image templates
macos-13-arm-test.pkr.hcl
): https://github.com/nodejs/build/pull/3882macos-13-intel-test.pkr.hcl
): https://github.com/nodejs/build/pull/3882macos-13-arm-release.pkr.hcl
): https://github.com/nodejs/build/pull/3893macos-13-intel-release.pkr.hcl
): https://github.com/nodejs/build/pull/3893Current status: Completed.
:white_check_mark: Trigger Ephemeral VMs from Jenkins
Current status: Completed
Jobs and Agents Migration
-mmacosx-version-min
(see: https://github.com/nodejs/build/issues/3876) in the test ci-mmacosx-version-min
(see: https://github.com/nodejs/build/issues/3876) in the release ciCurrent status: @UlisesGascon working on the setup.
Clean up
Other
Deadline The idea is to try to achieve this transition in 30 days.
Important
We don't expect any downtime will doing the migration as we will have a new cluster working on isolation will the current system is in place until we are ready to transfer the operations to the new cluster and then decommission the HW.
Challenges
Error: admission webhook "vimage.kb.io" denied the request: cannot delete image "macos13-intel-test-latest.img". The image is being used by one or more VMs: vm-ttdzh. Remove the VMs and try again
namespace
Based on the support ticket SERVICE-188003 and the documentation https://orkadocs.macstadium.com/docs/compatibility-versions#macos-and-apple-hardware.
We might won't be able to migrate MacOS ARM VMs to Orka due lack of support, so we might need to keep the Bare Metal machines.
This needs to be confirmed (cc: @ryanaslett )
Current status
I will be on PTO from the 19th to the 25th. I made some changes to the templates to add the missing dependencies (https://github.com/nodejs/build/pull/3906).
So, @ryanaslett, in case you want to help with this during my time off:
iojs+release-Ulises-test-orca
is not passing. The current error (10:09:52 Makefile:1030: *** No xz command, cannot continue. Stop.
) (details) is related to the PATH
(I think), as xz
is included on all the machines already since my last PR.Xcode 15.2
, based on the discussion with @targos. DocumentationProbably the next errors in the CI will be related to the users; currently, we only have the admin user. Maybe we need to create a separate one like iojs
in order to make the CI pipelines work.
Check why the iojs+release-Ulises-test-orca is not passing. The current error (10:09:52 Makefile:1030: *** No xz command, cannot continue. Stop.) (details) is related to the PATH (I think), as xz is included on all the machines already since my last PR.
Started in on this.
The PATH variable is set on the existing macos machines via the script that launches the jenkins agent: This template: https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/templates/start.j2#L10 Creates a script here: https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/tasks/main.yml#L179-L185 And this Template: https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/templates/org.nodejs.osx.jenkins.plist Gets put into /Library/LaunchDaemons https://github.com/nodejs/build/blob/main/ansible/roles/jenkins-worker/vars/main.yml#L33-L37
I've added ARCH, DESTCPU, and PATH to the Environment variables to the Orka Cluster Cloud Template configurations on ci-release machine.
The osx13-x64-release-tar job worked and signed the tarball, but failed to push the release to node-www, so, need to adjust that next.
We need this config in the image: https://github.com/nodejs/build/blob/main/ansible/roles/release-builder/files/ssh_config
We need this config in the image: https://github.com/nodejs/build/blob/main/ansible/roles/release-builder/files/ssh_config
node-www also has a ufw2 firewall and will not allow connections from ip addresses not on the allowlist.
I've added the main orka address to the ufw2 firewall on node-www (199.7.167.98) I've confirmed that this is the address that ephemeral nodes will all appear as to node-www.
I've requested the new nodes from MacStadium to fill out the rest of our capacity, and got a response today that they are aiming to have the nodes installed by Wed, Oct 30th.
Great to see the details and progress on this front.
One thought is that once everything is landed it would be great to do a deep dive session for other build team members who are interested in learning a bit more about now it works.
I plan to work on it during the weekend, so I can provide a good overview on the next build meeting on Tuesday.
Current tasks on MacOS infra
Blocked until ARM nodes are provided