nodejs / build

Better build and test infra for Node.
504 stars 165 forks source link

Migrate `*-joyent` machines to new Equinix system. OUTAGE required on 8th December #3108

Closed sxa closed 1 year ago

sxa commented 1 year ago

Spinning this out of https://github.com/nodejs/build/issues/3104 since the existing machines are currently back online temporarily.

In February 2021 some of our systems were migrated to from Joyent's data centers to Equinix using an account managed by Joyent team members which was separate from our existing one at Equinix.

Recently it became apparent that some of those were hosted on the Equinix data centers which were due to be shut down at the end of November. After a call today between myself, @richardlau and @bahamat we now have a good understanding of where we area and how to move forward.

To summarise where we currently are: There are two systems hosted on the account managed by Joyent. Both are SmartOS hosts with virtual images inside them. One of these hosts is called nc-backup-01 and only contains one VM - the backup server which is SmartOS 15.4 and is in the DFW2 data center.

The second host is called nc-compute-01 and it contains all the other systems referenced in #3104 - some are KVM instances, some are SmartOS zones and one is an lx-branded zone. The details and breakdown are as follows:

[root@nc-compute-01 ~]# vmadm list
UUID                                  TYPE  RAM      STATE             ALIAS
0f85685d-0150-4f8f-e211-9ecee63e8b61  KVM   3840     running           test-joyent-ubuntu1604_arm_cross-x64-1
1cf77dcc-8a17-6c35-9132-83f55a8e058f  KVM   3840     running           test-joyent-ubuntu1804-x64-1
49f0a164-4e86-4fda-de73-abcf257587a0  KVM   3840     running           release-joyent-ubuntu1604_arm_cross-x64-1
356655a2-12e6-e1d7-ac7b-b5188ad37cb0  OS    4096     running           test-joyent-smartos20-x64-3
49089cfe-915f-c226-c697-a9faca6041f2  OS    4096     running           release-joyent-smartos20-x64-2
94f76b46-6d20-612c-84e1-92c0dc3bae69  OS    4096     running           release-joyent-smartos18-x64-2
c6e3d47a-1421-ee11-c52d-c3c80c198e95  OS    4096     running           test-joyent-smartos20-x64-4
d894f3c6-d09a-c9df-d7ae-b6f613d9b413  OS    4096     running           test-joyent-smartos18-x64-3
db3664d7-dd31-c233-cafb-df79efb9d069  OS    4096     running           test-joyent-smartos18-x64-4
d357fd3c-a929-cd9c-da35-ad53b53e2875  KVM   7936     running           release-joyent-ubuntu1804_docker-x64-1
feb21098-8101-66f6-f410-bd092952f84e  KVM   16128    running           infra-joyent-debian10-x64-1
12fa9eea-ba7a-4d55-abd9-d32c64ae1965  LX    32768    running           infra-joyent-ubuntu1604-x64-1-new

The infrastructure team's public ssh key has now been put onto both SmartOS hosts so that those team members can access the systems. @richardlau and @sxa have also been invited to co-administer the Equnix instance hosting these two in case any recovery of the hosts is required, and to hopefully set up to receive notifications.

We explored a few potential options:

  1. Provision a new machine for the SmartOS systems and migrate the others to our existing Equinix account
  2. Provision a new machine and reprovision all the servers from scratch
  3. provision a new machine and migrate the existing instances across

Given that option 3 was feasible and solved the immediate problem with Equinix attempting to shut down their data centeras we have chosen that one and we intend to start migrating the system tomorrow (evening UTC). @bahamat will handle provisioning the replacement SmartOS host on Equinix and migrating the images across. This will result in an outage on these systems while the migration takes place. The new server will need to be Intel rather than AMD to support SmartOS' KVM implementation.

We will also aim to rename these so they do not have joyent in the name since they are now hosted at equinix (Likely using a new equinix_mnx provider name to indicate that it's hosted separately from our other equinix systems)

FYI @nodejs/build @nodejs/build-infra @vielmetti

vielmetti commented 1 year ago

Thanks @sxa . The machine roster is here https://deploy.equinix.com/product/servers/ and you'll pick an Intel version.

If you're picking data centers, this capacity dashboard https://deploy.equinix.com/developers/capacity-dashboard/ is helpful.

sxa commented 1 year ago

Thank you - that question about which DC to use did come up in the call and I figured you'd potentially have some pointers on that - the capacity dashboard should be useful for @bahamat

vielmetti commented 1 year ago

@sxa @bahamat If you avoid the "medium" or "low" capacity data centers you'll be fine.

sxa commented 1 year ago

Update: Most of the migration is complete and the test smartos' new IPs have been added so that they can get through the firewall to the CI server which is now running through the (fairly small) backlog. I've adjusted the comments next to the machines in the firewall configuration to have equinix_mnx instead of joyent in the name to indicate which ones I have changed. I have not done anything for the release machines which are currently unused. One of the infra- machines is quite large (3.2Tb) so will take a while to transfer between the data centers. NOTE: While adding the firewall rules I spotted that there were some entries for what were presumably old machines that no longer exist as they are not in our CI. I have created https://github.com/nodejs/build/issues/3109 to clear those entries up later.

richardlau commented 1 year ago

I have not done anything for the release machines which are currently unused.

That's only true for the smartos release machines. We're using the ubuntu1804 docker container to cross compile for armv7l -- I've updated the firewall on ci-release but the machine is one of those yet to be migrated.

richardlau commented 1 year ago

It looks like the smartos builds are currently broken :disappointed: e.g. https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos20-64/46914/console

10:23:41 ../deps/v8/src/base/platform/platform-posix.cc:80:16: error: conflicting declaration of C function 'int madvise(caddr_t, std::size_t, int)'
10:23:42    80 | extern "C" int madvise(caddr_t, size_t, int);
10:23:42       |                ^~~~~~~
10:23:42 In file included from ../deps/v8/src/base/platform/platform-posix.cc:18:
10:23:42 /usr/include/sys/mman.h:268:12: note: previous declaration 'int madvise(void*, std::size_t, int)'
10:23:42   268 | extern int madvise(void *, size_t, int);
10:23:42       |            ^~~~~~~
10:23:42 make[2]: *** [tools/v8_gypfiles/v8_libbase.target.mk:177: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos20-64/out/Release/obj.target/v8_libbase/deps/v8/src/base/platform/platform-posix.o] Error 1
10:23:42 make[2]: *** Waiting for unfinished jobs....

Also failing on the smartos18 machines in similar fashion for Node.js 14 builds as well as the builds for the main branch and pull requests.

sxa commented 1 year ago

Looks like this is a result of the SmartOS upgrade on the host (global zone) and the fact that all of the SmartOS local zones inherit /usr which has a modified /usr/local/sys/mman.h: The old one had:

extern int madvise(caddr_t, size_t, int);

and then later on:

#if !defined(__XOPEN_OR_POSIX) || defined(_XPG6) || defined(__EXTENSIONS__)
extern int posix_madvise(void *, size_t, int);
#endif

But the new one has :

#if !defined(_STRICT_POSIX) || defined(_XPG6)
extern int posix_madvise(void *, size_t, int);
#endif

and later:

#if !defined(_STRICT_POSIX)
extern int mincore(caddr_t, size_t, char *);
extern int memcntl(void *, size_t, int, void *, int, int);
extern int madvise(void *, size_t, int);
[...]

So we've lost the madvise definition with caddr_t as the first parameter which is almost certainly causing this compile failure. This may need a V8 patch unless we tweak the header files on the host to get around the immediate problem. I shall leave it in this state pending advise from @bahamat

sxa commented 1 year ago

FYI full diff of mmem.h between old an new systems: mmem.h.diff.txt.gz

richardlau commented 1 year ago

So we've lost the madvise definition with caddr_t as the first parameter which is almost certainly causing this compile failure. This may need a V8 patch unless we tweak the header files on the host to get around the immediate problem. I shall leave it in this state pending advise from @bahamat

It looks like this is a V8 issue where madvise is being declared specifically for V8_OS_SOLARIS: https://github.com/v8/v8/blob/458cda96fe5db5bded922caa80ed304ad8be2a72/src/base/platform/platform-posix.cc#L78-L84

#if defined(V8_OS_SOLARIS)
#if (defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE > 2) || defined(__EXTENSIONS__)
extern "C" int madvise(caddr_t, size_t, int);
#else
extern int madvise(caddr_t, size_t, int);
#endif
#endif

There was a PR opened on V8's GitHub mirror for this but it was closed and it looks like it was not upstreamed: https://github.com/v8/v8/pull/37

sxa commented 1 year ago

SmartOS host has been put back to the old version in order to get the 'known good' headers in the global zone and therefore the inherited /usr in the local zones. Seeing if the patch can be backported to all relevant V8 release lines can be done separately.

richardlau commented 1 year ago

I've updated the DNS entries for grafana and unencrypted in CloudFlare with the new IP addresses (although unencrypted is still being migrated so currently is down but DNS now points to the new address).

bahamat commented 1 year ago

This migration is complete. I don't have access to Jenkins (or at least don't know where it is) so I can't confirm all the nodes are connected.

If someone can verify this for me, then I think we can resolve this issue.

richardlau commented 1 year ago

I believe all the Jenkins agent have reconnected. I've opened a PR to update the IP addresses in the Ansible inventory. I've also renamed the test machines from "joyent" to "equinix_mnx". We probably want to rename the release and infra machines as well but I'm not going to have time to do that this year (today is my last working day of 2022).

Have we confirmed with Equinix whether the nc-backup-01 host is okay where it is or if it also needs to migrate? According to the web console it's in DA - DFW2 and I thought all three letter data facilities were supposed to be closed at the end of November.

bahamat commented 1 year ago

I’ll see what I can find out.

bahamat commented 1 year ago

Confirmed, dfw2 is also shutting down. I'll get nc-backup-01 migrated to da11.

mhdawson commented 1 year ago

@bahamat what is the old ip of nc-backup-01 ? I think it would be 139.178.83.227. Want to confirm which machine it corresponds to. If it is 139.178.83.227 then that machine I think mostly pulls from other machines to do backups.

In that case we may not need to configure ips other than updating in the inventory, but it might affect known_hosts on the machines it connects to. We'd want to validate that it can still connect to the other machines after the move.

richardlau commented 1 year ago

@mhdawson I believe nc-backup-01 is backup (139.178.83.227) based on earlier discussions with @bahamat and @sxa and I also believe you are correct that the machine is pulling from other machines.

In that case we may not need to configure ips other than updating in the inventory, but it might affect known_hosts on the machines it connects to. We'd want to validate that it can still connect to the other machines after the move.

We will also need to update the firewall on the download server (www/direct.nodejs.org) so that backup can connect to it.

bahamat commented 1 year ago

Yeah, that's the right IP.

In that case, I'll do the final sync and start up the new one. I'll let you know when it's up.

bahamat commented 1 year ago

OK, all finished. The new IPs are:

The old one is stopped but not destroyed yet. Once you can confirm that the new one is working as intended, I'll destroy the old one.

mhdawson commented 1 year ago

@richardlau I think you mentioned you were going to look at this?

richardlau commented 1 year ago

I've added 147.28.183.83 to the firewall on the www server (so the backup machine can rsync to it). However I don't seem to be able to ssh into 147.28.183.83 to verify if the backups are running.

richardlau commented 1 year ago

Thanks to @bahamat for fixing the network interfaces on the new backup machine. I've been able to log into it and AFAICT the backups ran, so I think we're good with the replacement.

vielmetti commented 1 year ago

@richardlau @bahamat - following up here, is this work completed to the point where this issue can be closed?

bahamat commented 1 year ago

Yes, I think so.

vielmetti commented 1 year ago

@bahamat I see that the last old s1 storage system was "powered off" last week - if you are completely ready then you (or I) can "delete" that system and we'll really be done.

bahamat commented 1 year ago

OK, all set. f88f2ae4-52dd-4613-b123-262d47bf5d2c | nc-backup-01 has been deleted.

vielmetti commented 1 year ago

Confirmed! Thanks @bahamat - I think this issue can be closed out.