nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Move 4 A100SXM4 GPU nodes out of prod and available for RHELAI #595

Closed joachimweyl closed 2 weeks ago

joachimweyl commented 1 month ago

Motivation

RHELAI needs 16 GPUs for testing for 1 month.

Completion Criteria

4 Lenovo GPU nodes usable by RHELAI team.

Description

Completion dates

Desired - 2024-05-31 Required - TBD

larsks commented 1 month ago

(see also #594)

hpdempsey commented 1 month ago

@joachimweyl this should be the highest priority task for MOC until it is finished. @tzumainn and @larsks have it as their highest priority task. We should be able to get Jeremy's folks using the GPUs this week if we're coordinated on priorities.

hpdempsey commented 1 month ago

Completion date on this ticket is not correct. Jeremy should have access by 5/31. Schedule is critical because this testing is part of releasing RHELAI for GA, and every day counts. It also is on our path to getting an open system for AI/ML set up in MOC.

joachimweyl commented 1 month ago

@jtriley what is required to remove these from production?

larsks commented 1 month ago

It looks like these are our A100 nodes:

$ k get node -l nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB
NAME      STATUS   ROLES    AGE   VERSION
wrk-101   Ready    worker   63d   v1.26.7+c7ee51f
wrk-90    Ready    worker   72d   v1.26.7+c7ee51f
wrk-91    Ready    worker   78d   v1.26.7+c7ee51f
wrk-92    Ready    worker   78d   v1.26.7+c7ee51f
wrk-93    Ready    worker   78d   v1.26.7+c7ee51f
wrk-94    Ready    worker   78d   v1.26.7+c7ee51f
wrk-95    Ready    worker   78d   v1.26.7+c7ee51f
wrk-96    Ready    worker   78d   v1.26.7+c7ee51f
wrk-97    Ready    worker   78d   v1.26.7+c7ee51f
wrk-98    Ready    worker   78d   v1.26.7+c7ee51f
wrk-99    Ready    worker   15d   v1.26.7+c7ee51f

@jtriley, I'd like to start by cordoning and draining nodes wrk-{90,91,92,93} and then deleting them from the cluster. Let me know if I should go ahead.

jtriley commented 1 month ago

@larsks That's fine we just need to be careful about user workloads that might be using emptydir storage on those hosts as that will be deleted.

jtriley commented 1 month ago

@larsks I'm taking a quick peek at oc adm drain --dry-run=server for those hosts now to see what would need to be cleaned up.

jtriley commented 1 month ago

Those hosts have the following user workloads on them currently using emptydir:

rhods-notebooks/jupyter-nb-sqasim-40bu-2eedu-0
ai4cloudops-f7f10d9/url-shorten-mongodb-756c75cfb-6c6k7
ai4cloudops-f7f10d9/user-mongodb-dbc858894-hk9w7
sail-24887a/wage-gap-calculator-mongo-bb6c9999b-tz9nl
gis-data-science-big-data-projects-at-cga-b231ed/postgres-postgres-rcqx-0
scsj-86c3ca/postgres-repo-host-0
smart-village-faeeb6c/postgres-repo-host-0

In the past we've reached out to folks to confirm those pods are safe to be deleted. I'm looking at the other GPU nodes to see if there's any that don't have this issue.

larsks commented 1 month ago

They all have some user workloads on them. Here's a count of pods/node for the GPU nodes:

$ kubectl get pods -A -o jsonpath='{range .items[?(@.spec.nodeName)]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c | sort -rn | grep -E 'wrk-9[0-9]|wrk-101'
   1635 wrk-91
   1221 wrk-98
   1021 wrk-95
    265 wrk-92
    102 wrk-99
     60 wrk-94
     58 wrk-97
     45 wrk-93
     44 wrk-96
     44 wrk-90
     41 wrk-101
hpdempsey commented 1 month ago

smart-village, rhods-notebooks and sail workloads should all be able to be shut down right away. The other ones look like research projects where owner should be contacted that we're going to shut it down and they need to move to a different machine.

jtriley commented 1 month ago

I've cordoned wrk-9[0-3] as requested. This is the final list of user pods using emptyDir after cordoning:

ai4cloudops-f7f10d9/llm-for-traces-0
rhods-notebooks/jupyter-nb-sqasim-40bu-2eedu-0
ai4cloudops-f7f10d9/url-shorten-mongodb-756c75cfb-6c6k7
ai4cloudops-f7f10d9/user-mongodb-dbc858894-hk9w7
sail-24887a/wage-gap-calculator-mongo-bb6c9999b-tz9nl
gis-data-science-big-data-projects-at-cga-b231ed/postgres-postgres-rcqx-0
scsj-86c3ca/postgres-repo-host-0
smart-village-faeeb6c/postgres-repo-host-0

@Milstein can we please reach out to those users and notify them that their pods need to be deleted from the host they're currently on? Please point out that those pods are using some data directories that will not be saved/persisted after this happens. Also see @hpdempsey comment above about the list and who needs contacting: https://github.com/nerc-project/operations/issues/595#issuecomment-2140554360

computate commented 1 month ago

@jtriley @hpdempsey it's not a problem if you interrupt smartvillage postgres pod.

Milstein commented 1 month ago

@jtriley : if the pod is delete, is that auto created in a new wrk-node?

larsks commented 1 month ago

@Milstein If a pod has a controller -- like a deployment, statefulset, etc -- it will get re-recreated on another node. If it's just a raw pod with no controller, it will not.

Milstein commented 1 month ago

@larsks : some of these pods have lots of postgres data - PV and PVC, I wonder if they will retained along with pod or user need to backup first?

larsks commented 1 month ago

@Milstein If the postgres data is on a volume, the volume will persist. If the data is not on a volume and the pod is deleted, the data will be lost.

computate commented 1 month ago

Also ok to interrupt scsj-86c3ca/postgres-repo-host-0.

larsks commented 1 month ago

I am deleting pods on these nodes for the projects we've already identified as interruptible:

Milstein commented 1 month ago

Also, please delete these pods:

rhods-notebooks/jupyter-nb-sqasim-40bu-2eedu-0
ai4cloudops-f7f10d9/llm-for-traces-0
gis-data-science-big-data-projects-at-cga-b231ed/postgres-postgres-rcqx-0
DanNiESh commented 1 month ago

Notebook jupyter-nb-sqasim-40bu-2eedu deleted

Milstein commented 1 month ago

I have just notified the ai4cloudops team. We need to wait until we hear back from them.

larsks commented 1 month ago

It looks like we can drain all the nodes except for wrk-91, which has some ai4cloudops workloads that are using emptydir volumes.

larsks commented 1 month ago

I have drained nodes wrk-{90,92,93}.

Milstein commented 1 month ago

Just got confirmation that ai4cloudops pods can also be restarted!

larsks commented 1 month ago

All the nodes have been drained. @jtriley I'd like to go ahead and remove these nodes from the cluster (oc delete node...).

jtriley commented 1 month ago

@larsks just ran the drain again with --ignore-daemonsets --delete-emptydir-data. I will go ahead and delete them and ping @hakasapl to flip the ports to ESI.

larsks commented 1 month ago

just ran the drain again with --ignore-daemonsets --delete-emptydir-data

@jtriley yes, that is how I ran it as well.

jtriley commented 1 month ago

All of those hosts have been drained, deleted, and powered off. @hakasapl we can move these host's ports to ESI.

jtriley commented 1 month ago

just ran the drain again with --ignore-daemonsets --delete-emptydir-data

@jtriley yes, that is how I ran it as well.

No worries, just noticed running drain again for drill was still taking action and complaining 🤷

hpdempsey commented 1 month ago

Hakan plans to configure the 4 servers for ESI on 5/31 afternoon.

hpdempsey commented 1 month ago

Hakan remembered the requisite cable was already installed, so configured the servers on 5/30 evening. @tzumainn added the nodes to ESI where they passed inspection and provisioning. He has leased three of the 4 nodes to RHELAI. The nodes had to boot in legacy mode instead of uefi mode to be able to be leased. Mainn is experimenting with the 4th node, which is in uefi mode to see if he can get it to work.

hpdempsey commented 1 month ago

@larsks please verify that the 3 legacy nodes work for Jeremy's general use case, and if so, you/Mainn walk Jeremy through accessing them per https://massopencloud.slack.com/archives/C027TDE52TZ/p1717106255073829?thread_ts=1717088865.659369&cid=C027TDE52TZ . Jeremy has his NERC account, but we haven't asked him to access the ESI dashboard yet. The directions talk about an MGHPCC account, which will be confusing, since the first account he applied for is called a NERC account.

hpdempsey commented 1 month ago

@hakasapl reconfigured servers for ESI on 5/30 (cable was already in place). @tzumainn put them in ESI and they passed inspection and provisioning test. @tzumainn leased three of the 4 to RHELAI. However, they had to boot in legacy mode, not uefi for this to work. Mainn is still experimenting with one node in uefi mode.

hpdempsey commented 1 month ago

@larks please check out the 3 legacy nodes for what you know of Jeremey's general use case. Lars/Mainn walk Jeremy through access following https://massopencloud.slack.com/archives/C027TDE52TZ/p1717106255073829?thread_ts=1717088865.659369&cid=C027TDE52TZ . The directions talk about an MGHPCC account, which will be confusing, because Jeremy applied for a NERC account first. (We probably need to add to the directions when we have time, because we don't ask users to get an MGHPCC account now -- even though they could.)

larsks commented 1 month ago

We had to drop ESI because of some conflicts between esi and uefi booting (and between legacy boot mode and system stability). We have attached all the nodes to a public network and set up a bastion host for access to the bmcs.

The bastion host is rhelai-gw.int.massopen.cloud. Admin access is via the cloud-user account, to which I have added ssh keys for @larsks, @naved001, and @hakasapl.

Here's the email I sent to Jeremy:

Jeremy,

There are now four GPU nodes available for your use.

We have configured a bastion host at rhelai-gw.int.massopen.cloud to
provide access to the bmcs of the gpu nodes. You can log in as jeremyeder
using the private key associated with https://github.com/jeremyeder.keys.

The BMC addresses are in the /etc/hosts file:

    10.2.18.127 gpu0-bmc
    10.2.18.128 gpu1-bmc
    10.2.18.130 gpu2-bmc
    10.2.18.131 gpu3-bmc

You will find credentials for these devices in the bmc-credentials.txt
file in your home directory.

We have allocated four ip address four you on the 129.10.5.0/24
network:

- 129.10.5.160
- 129.10.5.161
- 129.10.5.162
- 129.10.5.163

The default gateway for this network is 129.10.5.1. You will need to
assign these addresses manually; there is no DHCP service available.

You have a few options for accessing the bmcs from your local machine.
This is what I do:

1. Establish a SOCKS proxy when connecting to rhelai-gw:

    ssh -D1080 jeremyeder@rhelai-gw.int.massopen.cloud

2. Configure your browser to use the SOCKS proxy at localhost:8080. I
   use Proxy SwitchyOmega [1] in Chrome and FoxyProxy [2] in Firefox.

Some of my colleages prefer to use ssshuttle [3] instead.

Once you have established a connection:

3. Log in to the bmc using the credentials from bmc-credentials.txt.

4. Select the "Remote Console" menu option from the left navigation
   bar.

5. Select the "Launch console" button at the top of the following
   screen. You will need to enable popups for this to work.

6. Once the console is open (I had to switch browsers to get this to
   work), click the "Media" button at the top of the page to access
   the virtual media functions.

7. Click "Activate" to enable virtual media features, then "Browse" to
   select an installer image, and finally "Mount all local media" to
   mount the media on the remote system.

8. At this point you should be able to power on the system using the
   "Power" button at the top of the screen and have it boot into your
   installer.

Let me know if you have any questions.

[1]: https://chromewebstore.google.com/detail/proxy-switchyomega/padekgcemlokbadohgkifijomclgjgif
[2]: https://addons.mozilla.org/en-US/firefox/addon/foxyproxy-standard/
[3]: https://github.com/sshuttle/sshuttle
larsks commented 1 month ago

I will close this issue once Jeremy has confirmed that he is able to access things.

joachimweyl commented 2 weeks ago

Jeremy Eder confirmed he is able to access things 06/04.

joachimweyl commented 4 days ago

@larsks Which 4 wrk nodes did we end up snagging for this? What is the ESI project name for this set of 4 nodes?