Closed mtelvers closed 1 year ago
Deleted, lots of NVMe errors on the log. How's the cluster capacity at the moment? Wonder if we should just release all those old x86-bm machines and provision a new set now at Scaleway with the nextgen ones. Would also allow us to get them onto the renewable energy machines as well.
@mtelvers also, is 51.159.31.39 an orphaned Scaleway machine that can be cleaned up?
It looks like it might have the ocurrent deployer hooked up to it, but I cant find a DNS entry for that machine.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
64430bd6f825d114e249ae3348cfb704828a5836a5f2e01a77d3b60568ba9830 ocurrent/v3.ocaml.org-server@sha256:81dfc89e2709bb0b7c4f5c8a03c20842bb8cdd0422b788fa8352428253edfab8 "/bin/sh -c /bin/server" 5 weeks ago Up 5 weeks 8080/tcp infra_www.1.wo0upe1dn9s1005ei24rfydoz
25017fb9684ae9ffe30a87434bd39ff7ec38586096d51a8fe0f82da03a01d116 caddy:latest@sha256:607a656e512737b5c4e7f7254dd49e1688b721915fb9ad1611812a12361d7d69 "caddy run --config /etc/caddy/Caddyfile --adapter caddyfile" 5 months ago Up 5 months 80/tcp, 443/tcp, 2019/tcp infra_caddy.1.hoaougdsyr5p93t0919fjot7s
@avsm 51.159.31.39 certainly looks to be orphaned. I can't see a DNS entry for it, either. From /etc/caddy/Caddyfile
, it was once staging.ocaml.org
. staging.ocaml.org
now points to 51.159.190.183, so I would advocate cleaning it up.
The cluster capacity erodes over time as the x86-bm machines fail and the Equinix migration from m1
to c2
reduced it further. The x86-bm machines have served us well, but newer nextgen hardware and renewable energy are good incentives to swap them out.
I note that x86-bm-8.ocamllabs.io
isn't part of the cluster. I don't know what it is, as I can't sign in.
Looks like there's no equivalent to the bm machines in the new Scaleway lineup. A 32-core baremetal instance is 5x the price, so that's not an option. I'm provisioning a test 20-thread/10-core one to see how it performs; stay tuned...
@mtelvers try out ubuntu@x86-bm-b1.ocamllabs.io in the pool. Would be good to know what its capacity/performance is vs the existing Scaleway x86-bm* machines, as it's a newer one but more expensive.
I have set up the machine as a worker with an initial capacity of 24 -- the same as the Scaleway machines -- and added it to the pool.
The disk configuration wasn't ideal for our purposes as the root ext4 file system was set up as a software RAID1. As a workaround, I have ejected /dev/sdb4
from the mirror and converted it to BTRFS, allowing us to test, but it's a bit untidy.
Sounds good -- the default configuration is whatever Scaleway setup on their new Elastic Metal service, so your change is for the better to bring the machine back to our requirements (no redundancy required on storage, more space for the builds). When you get a rough idea of whether these machines are performing ok, I can provision a few more of these and we can rotate out the other x86-bm as their SSDs continue to fail.
24 jobs seemed well within the capability of the machine. I have increased to 36.
That's encouraging. It does look like the VMs don't like burst CPU usage (as builds will do), so these new bare metal ones will probably be more efficient overall.
36 was too high. 30 was ok. Testing at 32.
~30 is about the right figure.
@avsm x86-bm-1
has now failed, presumably with NVMe issues. Please can you take a look and reset/delete as necessary?
Sounds like it's time to provision a bunch of the newer machines... I'll take a look this week.
@mtelvers I've gone ahead and provisioned more of the elastic metal machines, as I think it's time to decommission the remaining older x86-bm-* VMs. They're going to steadily start failing more rapidly, so might as well get ahead of the game.
I've started putting these under the .sw.ocaml.org to indicate they are Scaleway. Time to start migrating the hostnames to the ocaml.org domain as they're seeing production use in the CI nowadays.
Once these are in the cluster and happy, feel free to start decommissioning the old VMs and I'll clean them up. If you need a few more x86 EM-B212X-SSD hosts, those can be provisioned easily.
@avsm Thank you very much for the new machines.
b2
, c1
and c3
have been added to the cluster, however, c2
does not appear to have a DNS entry.
I will now remove the old machines: 3, 4, 6, 7, 9, 10 and 12.
Thanks, c2 is now also in DNS (i missed the sw
postfix). How's the cluster load looking without the old VMs? I'll go ahead and delete those tomorrow if the new ones are running smoothly.
Thanks. I have added in c2
. It has been an atypically quiet day on the cluster, so it's difficult to say. Headline figures are that we have removed a capacity of 192 concurrent jobs and added back a capacity of 150 concurrent jobs, but that doesn't take into consideration that this is much newer hardware.
@mtelvers I've provisioned two more of the new bare metal machines as they are becoming more available now on Scaleway, with the same spec: x86-bm-c4.sw.ocaml.org
and x86-bm-c5.sw.ocaml.org
. If you could rotate those into the build pool, that should help with the backlog handling in the very short term. Let me know if you think we should add a few more in the short term.
I will now remove the old machines: 3, 4, 6, 7, 9, 10 and 12.
These are all powered off, will delete them later.
@avsm Thank you. I have brought c4
and c5
into service. This is greatly appreciated as opam-repo-ci
is keeping the linux-x86_64 pool very busy at the moment!
@mtelvers given the backlog, I've stuck some more resources into the cluster and also added x86-bm-c6
and x86-bm-c7
with the same specs as well. Please feel free to add those into the cluster as well, and let me know in a new ticket if they're being underutilised so I can free them in the future.
I've cleaned up the old machines now so we're exclusively on the new hardware, so this ticket is closed.
CI worker
x86-bm-5.ocamllabs.io
is no longer responding to SSH. Please can it be reset? I suspect another NMVe failure.