Closed joachimweyl closed 5 months ago
(see also #594)
@joachimweyl this should be the highest priority task for MOC until it is finished. @tzumainn and @larsks have it as their highest priority task. We should be able to get Jeremy's folks using the GPUs this week if we're coordinated on priorities.
Completion date on this ticket is not correct. Jeremy should have access by 5/31. Schedule is critical because this testing is part of releasing RHELAI for GA, and every day counts. It also is on our path to getting an open system for AI/ML set up in MOC.
@jtriley what is required to remove these from production?
It looks like these are our A100 nodes:
$ k get node -l nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB
NAME STATUS ROLES AGE VERSION
wrk-101 Ready worker 63d v1.26.7+c7ee51f
wrk-90 Ready worker 72d v1.26.7+c7ee51f
wrk-91 Ready worker 78d v1.26.7+c7ee51f
wrk-92 Ready worker 78d v1.26.7+c7ee51f
wrk-93 Ready worker 78d v1.26.7+c7ee51f
wrk-94 Ready worker 78d v1.26.7+c7ee51f
wrk-95 Ready worker 78d v1.26.7+c7ee51f
wrk-96 Ready worker 78d v1.26.7+c7ee51f
wrk-97 Ready worker 78d v1.26.7+c7ee51f
wrk-98 Ready worker 78d v1.26.7+c7ee51f
wrk-99 Ready worker 15d v1.26.7+c7ee51f
@jtriley, I'd like to start by cordoning and draining nodes wrk-{90,91,92,93} and then deleting them from the cluster. Let me know if I should go ahead.
@larsks That's fine we just need to be careful about user workloads that might be using emptydir storage on those hosts as that will be deleted.
@larsks I'm taking a quick peek at oc adm drain --dry-run=server
for those hosts now to see what would need to be cleaned up.
Those hosts have the following user workloads on them currently using emptydir:
rhods-notebooks/jupyter-nb-sqasim-40bu-2eedu-0
ai4cloudops-f7f10d9/url-shorten-mongodb-756c75cfb-6c6k7
ai4cloudops-f7f10d9/user-mongodb-dbc858894-hk9w7
sail-24887a/wage-gap-calculator-mongo-bb6c9999b-tz9nl
gis-data-science-big-data-projects-at-cga-b231ed/postgres-postgres-rcqx-0
scsj-86c3ca/postgres-repo-host-0
smart-village-faeeb6c/postgres-repo-host-0
In the past we've reached out to folks to confirm those pods are safe to be deleted. I'm looking at the other GPU nodes to see if there's any that don't have this issue.
They all have some user workloads on them. Here's a count of pods/node for the GPU nodes:
$ kubectl get pods -A -o jsonpath='{range .items[?(@.spec.nodeName)]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c | sort -rn | grep -E 'wrk-9[0-9]|wrk-101'
1635 wrk-91
1221 wrk-98
1021 wrk-95
265 wrk-92
102 wrk-99
60 wrk-94
58 wrk-97
45 wrk-93
44 wrk-96
44 wrk-90
41 wrk-101
smart-village, rhods-notebooks and sail workloads should all be able to be shut down right away. The other ones look like research projects where owner should be contacted that we're going to shut it down and they need to move to a different machine.
I've cordoned wrk-9[0-3]
as requested. This is the final list of user pods using emptyDir after cordoning:
ai4cloudops-f7f10d9/llm-for-traces-0
rhods-notebooks/jupyter-nb-sqasim-40bu-2eedu-0
ai4cloudops-f7f10d9/url-shorten-mongodb-756c75cfb-6c6k7
ai4cloudops-f7f10d9/user-mongodb-dbc858894-hk9w7
sail-24887a/wage-gap-calculator-mongo-bb6c9999b-tz9nl
gis-data-science-big-data-projects-at-cga-b231ed/postgres-postgres-rcqx-0
scsj-86c3ca/postgres-repo-host-0
smart-village-faeeb6c/postgres-repo-host-0
@Milstein can we please reach out to those users and notify them that their pods need to be deleted from the host they're currently on? Please point out that those pods are using some data directories that will not be saved/persisted after this happens. Also see @hpdempsey comment above about the list and who needs contacting: https://github.com/nerc-project/operations/issues/595#issuecomment-2140554360
@jtriley @hpdempsey it's not a problem if you interrupt smartvillage postgres pod.
@jtriley : if the pod is delete, is that auto created in a new wrk-node?
@Milstein If a pod has a controller -- like a deployment, statefulset, etc -- it will get re-recreated on another node. If it's just a raw pod with no controller, it will not.
@larsks : some of these pods have lots of postgres data - PV and PVC, I wonder if they will retained along with pod or user need to backup first?
@Milstein If the postgres data is on a volume, the volume will persist. If the data is not on a volume and the pod is deleted, the data will be lost.
Also ok to interrupt scsj-86c3ca/postgres-repo-host-0
.
I am deleting pods on these nodes for the projects we've already identified as interruptible:
Also, please delete these pods:
rhods-notebooks/jupyter-nb-sqasim-40bu-2eedu-0
ai4cloudops-f7f10d9/llm-for-traces-0
gis-data-science-big-data-projects-at-cga-b231ed/postgres-postgres-rcqx-0
Notebook jupyter-nb-sqasim-40bu-2eedu
deleted
I have just notified the ai4cloudops
team. We need to wait until we hear back from them.
It looks like we can drain all the nodes except for wrk-91
, which has some ai4cloudops
workloads that are using emptydir
volumes.
I have drained nodes wrk-{90,92,93}.
Just got confirmation that ai4cloudops
pods can also be restarted!
All the nodes have been drained. @jtriley I'd like to go ahead and remove these nodes from the cluster (oc delete node...
).
@larsks just ran the drain again with --ignore-daemonsets --delete-emptydir-data
. I will go ahead and delete them and ping @hakasapl to flip the ports to ESI.
just ran the drain again with --ignore-daemonsets --delete-emptydir-data
@jtriley yes, that is how I ran it as well.
All of those hosts have been drained, deleted, and powered off. @hakasapl we can move these host's ports to ESI.
just ran the drain again with --ignore-daemonsets --delete-emptydir-data
@jtriley yes, that is how I ran it as well.
No worries, just noticed running drain again for drill was still taking action and complaining 🤷
Hakan plans to configure the 4 servers for ESI on 5/31 afternoon.
Hakan remembered the requisite cable was already installed, so configured the servers on 5/30 evening. @tzumainn added the nodes to ESI where they passed inspection and provisioning. He has leased three of the 4 nodes to RHELAI. The nodes had to boot in legacy mode instead of uefi mode to be able to be leased. Mainn is experimenting with the 4th node, which is in uefi mode to see if he can get it to work.
@larsks please verify that the 3 legacy nodes work for Jeremy's general use case, and if so, you/Mainn walk Jeremy through accessing them per https://massopencloud.slack.com/archives/C027TDE52TZ/p1717106255073829?thread_ts=1717088865.659369&cid=C027TDE52TZ . Jeremy has his NERC account, but we haven't asked him to access the ESI dashboard yet. The directions talk about an MGHPCC account, which will be confusing, since the first account he applied for is called a NERC account.
@hakasapl reconfigured servers for ESI on 5/30 (cable was already in place). @tzumainn put them in ESI and they passed inspection and provisioning test. @tzumainn leased three of the 4 to RHELAI. However, they had to boot in legacy mode, not uefi for this to work. Mainn is still experimenting with one node in uefi mode.
@larks please check out the 3 legacy nodes for what you know of Jeremey's general use case. Lars/Mainn walk Jeremy through access following https://massopencloud.slack.com/archives/C027TDE52TZ/p1717106255073829?thread_ts=1717088865.659369&cid=C027TDE52TZ . The directions talk about an MGHPCC account, which will be confusing, because Jeremy applied for a NERC account first. (We probably need to add to the directions when we have time, because we don't ask users to get an MGHPCC account now -- even though they could.)
We had to drop ESI because of some conflicts between esi and uefi booting (and between legacy boot mode and system stability). We have attached all the nodes to a public network and set up a bastion host for access to the bmcs.
The bastion host is rhelai-gw.int.massopen.cloud
. Admin access is via the cloud-user
account, to which I have added ssh keys for @larsks, @naved001, and @hakasapl.
Here's the email I sent to Jeremy:
Jeremy,
There are now four GPU nodes available for your use.
We have configured a bastion host at rhelai-gw.int.massopen.cloud to
provide access to the bmcs of the gpu nodes. You can log in as jeremyeder
using the private key associated with https://github.com/jeremyeder.keys.
The BMC addresses are in the /etc/hosts file:
10.2.18.127 gpu0-bmc
10.2.18.128 gpu1-bmc
10.2.18.130 gpu2-bmc
10.2.18.131 gpu3-bmc
You will find credentials for these devices in the bmc-credentials.txt
file in your home directory.
We have allocated four ip address four you on the 129.10.5.0/24
network:
- 129.10.5.160
- 129.10.5.161
- 129.10.5.162
- 129.10.5.163
The default gateway for this network is 129.10.5.1. You will need to
assign these addresses manually; there is no DHCP service available.
You have a few options for accessing the bmcs from your local machine.
This is what I do:
1. Establish a SOCKS proxy when connecting to rhelai-gw:
ssh -D1080 jeremyeder@rhelai-gw.int.massopen.cloud
2. Configure your browser to use the SOCKS proxy at localhost:8080. I
use Proxy SwitchyOmega [1] in Chrome and FoxyProxy [2] in Firefox.
Some of my colleages prefer to use ssshuttle [3] instead.
Once you have established a connection:
3. Log in to the bmc using the credentials from bmc-credentials.txt.
4. Select the "Remote Console" menu option from the left navigation
bar.
5. Select the "Launch console" button at the top of the following
screen. You will need to enable popups for this to work.
6. Once the console is open (I had to switch browsers to get this to
work), click the "Media" button at the top of the page to access
the virtual media functions.
7. Click "Activate" to enable virtual media features, then "Browse" to
select an installer image, and finally "Mount all local media" to
mount the media on the remote system.
8. At this point you should be able to power on the system using the
"Power" button at the top of the screen and have it boot into your
installer.
Let me know if you have any questions.
[1]: https://chromewebstore.google.com/detail/proxy-switchyomega/padekgcemlokbadohgkifijomclgjgif
[2]: https://addons.mozilla.org/en-US/firefox/addon/foxyproxy-standard/
[3]: https://github.com/sshuttle/sshuttle
I will close this issue once Jeremy has confirmed that he is able to access things.
Jeremy Eder confirmed he is able to access things 06/04.
@larsks Which 4 wrk nodes did we end up snagging for this? What is the ESI project name for this set of 4 nodes?
Motivation
RHELAI needs 16 GPUs for testing for 1 month.
Completion Criteria
4 Lenovo GPU nodes usable by RHELAI team.
Description
Completion dates
Desired - 2024-05-31 Required - TBD