ocaml / infrastructure

WIki to hold the information about the machine resources available to OCaml.org
40 stars 9 forks source link

scaleway elastic metal machine refresh (was: x86-bm-5.ocamllabs.io) #21

Closed mtelvers closed 1 year ago

mtelvers commented 1 year ago

CI worker x86-bm-5.ocamllabs.io is no longer responding to SSH. Please can it be reset? I suspect another NMVe failure.

avsm commented 1 year ago

Deleted, lots of NVMe errors on the log. How's the cluster capacity at the moment? Wonder if we should just release all those old x86-bm machines and provision a new set now at Scaleway with the nextgen ones. Would also allow us to get them onto the renewable energy machines as well.

avsm commented 1 year ago

@mtelvers also, is 51.159.31.39 an orphaned Scaleway machine that can be cleaned up?

It looks like it might have the ocurrent deployer hooked up to it, but I cant find a DNS entry for that machine.

CONTAINER ID                                                       IMAGE                                                                                                  COMMAND                                                         CREATED        STATUS        PORTS                       NAMES
64430bd6f825d114e249ae3348cfb704828a5836a5f2e01a77d3b60568ba9830   ocurrent/v3.ocaml.org-server@sha256:81dfc89e2709bb0b7c4f5c8a03c20842bb8cdd0422b788fa8352428253edfab8   "/bin/sh -c /bin/server"                                        5 weeks ago    Up 5 weeks    8080/tcp                    infra_www.1.wo0upe1dn9s1005ei24rfydoz
25017fb9684ae9ffe30a87434bd39ff7ec38586096d51a8fe0f82da03a01d116   caddy:latest@sha256:607a656e512737b5c4e7f7254dd49e1688b721915fb9ad1611812a12361d7d69                   "caddy run --config /etc/caddy/Caddyfile --adapter caddyfile"   5 months ago   Up 5 months   80/tcp, 443/tcp, 2019/tcp   infra_caddy.1.hoaougdsyr5p93t0919fjot7s
mtelvers commented 1 year ago

@avsm 51.159.31.39 certainly looks to be orphaned. I can't see a DNS entry for it, either. From /etc/caddy/Caddyfile, it was once staging.ocaml.org. staging.ocaml.org now points to 51.159.190.183, so I would advocate cleaning it up.

The cluster capacity erodes over time as the x86-bm machines fail and the Equinix migration from m1 to c2 reduced it further. The x86-bm machines have served us well, but newer nextgen hardware and renewable energy are good incentives to swap them out.

I note that x86-bm-8.ocamllabs.io isn't part of the cluster. I don't know what it is, as I can't sign in.

avsm commented 1 year ago

Looks like there's no equivalent to the bm machines in the new Scaleway lineup. A 32-core baremetal instance is 5x the price, so that's not an option. I'm provisioning a test 20-thread/10-core one to see how it performs; stay tuned...

avsm commented 1 year ago

@mtelvers try out ubuntu@x86-bm-b1.ocamllabs.io in the pool. Would be good to know what its capacity/performance is vs the existing Scaleway x86-bm* machines, as it's a newer one but more expensive.

mtelvers commented 1 year ago

I have set up the machine as a worker with an initial capacity of 24 -- the same as the Scaleway machines -- and added it to the pool.

The disk configuration wasn't ideal for our purposes as the root ext4 file system was set up as a software RAID1. As a workaround, I have ejected /dev/sdb4 from the mirror and converted it to BTRFS, allowing us to test, but it's a bit untidy.

avsm commented 1 year ago

Sounds good -- the default configuration is whatever Scaleway setup on their new Elastic Metal service, so your change is for the better to bring the machine back to our requirements (no redundancy required on storage, more space for the builds). When you get a rough idea of whether these machines are performing ok, I can provision a few more of these and we can rotate out the other x86-bm as their SSDs continue to fail.

mtelvers commented 1 year ago

24 jobs seemed well within the capability of the machine. I have increased to 36.

avsm commented 1 year ago

That's encouraging. It does look like the VMs don't like burst CPU usage (as builds will do), so these new bare metal ones will probably be more efficient overall.

mtelvers commented 1 year ago

36 was too high. 30 was ok. Testing at 32.

mtelvers commented 1 year ago

~30 is about the right figure.

mtelvers commented 1 year ago

@avsm x86-bm-1 has now failed, presumably with NVMe issues. Please can you take a look and reset/delete as necessary?

avsm commented 1 year ago

Sounds like it's time to provision a bunch of the newer machines... I'll take a look this week.

avsm commented 1 year ago

@mtelvers I've gone ahead and provisioned more of the elastic metal machines, as I think it's time to decommission the remaining older x86-bm-* VMs. They're going to steadily start failing more rapidly, so might as well get ahead of the game.

I've started putting these under the .sw.ocaml.org to indicate they are Scaleway. Time to start migrating the hostnames to the ocaml.org domain as they're seeing production use in the CI nowadays.

Once these are in the cluster and happy, feel free to start decommissioning the old VMs and I'll clean them up. If you need a few more x86 EM-B212X-SSD hosts, those can be provisioned easily.

mtelvers commented 1 year ago

@avsm Thank you very much for the new machines.

b2, c1 and c3 have been added to the cluster, however, c2 does not appear to have a DNS entry.

I will now remove the old machines: 3, 4, 6, 7, 9, 10 and 12.

avsm commented 1 year ago

Thanks, c2 is now also in DNS (i missed the sw postfix). How's the cluster load looking without the old VMs? I'll go ahead and delete those tomorrow if the new ones are running smoothly.

mtelvers commented 1 year ago

Thanks. I have added in c2. It has been an atypically quiet day on the cluster, so it's difficult to say. Headline figures are that we have removed a capacity of 192 concurrent jobs and added back a capacity of 150 concurrent jobs, but that doesn't take into consideration that this is much newer hardware.

avsm commented 1 year ago

@mtelvers I've provisioned two more of the new bare metal machines as they are becoming more available now on Scaleway, with the same spec: x86-bm-c4.sw.ocaml.org and x86-bm-c5.sw.ocaml.org. If you could rotate those into the build pool, that should help with the backlog handling in the very short term. Let me know if you think we should add a few more in the short term.

I will now remove the old machines: 3, 4, 6, 7, 9, 10 and 12.

These are all powered off, will delete them later.

mtelvers commented 1 year ago

@avsm Thank you. I have brought c4 and c5 into service. This is greatly appreciated as opam-repo-ci is keeping the linux-x86_64 pool very busy at the moment!

avsm commented 1 year ago

@mtelvers given the backlog, I've stuck some more resources into the cluster and also added x86-bm-c6 and x86-bm-c7 with the same specs as well. Please feel free to add those into the cluster as well, and let me know in a new ticket if they're being underutilised so I can free them in the future.

I've cleaned up the old machines now so we're exclusively on the new hardware, so this ticket is closed.