nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Move AI hosts back to ESI from RHELAI #671

Closed larsks closed 1 month ago

larsks commented 2 months ago

Motivation

We allocated four hosts for RHELAI testing in https://github.com/nerc-project/operations/issues/595. We weren't able to use ESI at the time because of some issues around UEFI booting. Those issues have since been resolved, so we should move those four systems back under ESI control.

BMC addresses of the allocated nodes:

10.2.18.127
10.2.18.128
10.2.18.130
10.2.18.131

Completion Criteria

Move all 4 into ESI. provide one back to RHELAI team. provide one to OpenShift testing and move the other 2 back into OpenShift Prod.

Description

Completion dates

Desired - 2024-08-16 Required - TBD

larsks commented 2 months ago

(We should make sure people aren't actively using them first; the best contact is probably Jeremy Eder or @hpdempsey.)

hpdempsey commented 2 months ago

I checked with Jeremy, and none of these nodes are currently in use this week. Please move them back to ESI. @tzumainn is there anything we need to do besides adding them back in to ESI to "recapture" them as available?

One person on the RHELAI team is going to want to use one node for pen testing next week. I think we can just lease them 1 GPU when they are ready to do this testing, right @tzumainn ? Ideally we can just lease him back one GPU when he is ready. (BTW, I raised that we don't want things like DOS pen testing happening on MOC, but he said it was just internal pen testing.

tzumainn commented 2 months ago

The nodes are actually still in ESI; they're simply in maintenance mode. What I would do is:

a) delete the rhelai leases b) take them out of maintenance mode c) update their capabilities so they boot using UEFI d) provision one just to be sure that everything works as expected

@hpdempsey just to be realllllly sure - I can do all this now, correct?

hpdempsey commented 2 months ago

Yes, you can reallllly do it now! ;-)

tzumainn commented 2 months ago

Okay! This is all done. I also provisioned one of the nodes with an ubuntu image, and verified that it booted in UEFI mode. After booting, I was able to login, see the GPU mode, and confirm that the node stayed on for more than a few minutes (which was the issue when booting in legacy mode).

Is it possible to acquire the images that the RHELAI engineers are going to use? It might be nice to upload those images for them, and to do a provisioning test just to ensure there are no kinks.

joachimweyl commented 2 months ago

@dystewart one of these nodes are the ones that Heidi was suggesting could be used for testing. What do you need done to move one of these over to OpenShift Test?

hpdempsey commented 2 months ago

The image that the RHELAI engineers were using was the RHELAI candidate image for GA. The final GA image won't be available until Sept. 4. I am not sure what image the pen testing engineer will want to use, but I know that the RHELAI group said it wouldn't be available until next week, so we can't pre-load the right image. If you want to just try loading any dev preview version just for interest's sake, talk to Taj or Chris Tate for pointers to versions they were looking at @tzumainn

hpdempsey commented 2 months ago

@dystewart one of these nodes are the ones that Heidi was suggesting could be used for testing. What do you need done to move one of these over to OpenShift Test?

I think @dystewart should be able to just access ESI and do it himself at this point.

tzumainn commented 2 months ago

The image that the RHELAI engineers were using was the RHELAI candidate image for GA. The final GA image won't be available until Sept. 4. I am not sure what image the pen testing engineer will want to use, but I know that the RHELAI group said it wouldn't be available until next week, so we can't pre-load the right image. If you want to just try loading any dev preview version just for interest's sake, talk to Taj or Chris Tate for pointers to versions they were looking at @tzumainn

@tssala23 or @computate, do you have a pointer to one of these RHELAI images?

computate commented 2 months ago

@tzumainn the docs for installing RHELAI are not easy, and involve a separate machine for building the host bootc containers—separate from the machine that will actually run RHELAI. See the installation docs for RHELAI. Do you want to work together to install it between 2 machines in ESI?

tzumainn commented 2 months ago

I'm very new to this, but I had thought that there was a downloadable bootc image available... ?

joachimweyl commented 2 months ago

@computate it sounds like we might need more hardware allocated to the RHELAI project since currently, the only hardware they have is a single GPU node.

computate commented 2 months ago

We might need new hardware @joachimweyl , and @tzumainn I think we need to meet with a real person from RHELAI to know for sure.

computate commented 2 months ago

@tzumainn are you the right person to set up a separate ESI machine with these requirements?

tzumainn commented 2 months ago

@computate I'm the right person to ask about getting access to the machine :) You'd want to sign up for the MOC ESI by following the steps here:

https://esi.readthedocs.io/en/latest/moc-esi/sso.html#new-user-sign-up

After that's done, I could create an OpenStack project and assign you a node. You'd have to do the setup of the node - its network and configuration - using the information here:

https://esi.readthedocs.io/en/latest/usage/new_user_guide.html

At a quick glance, we don't have a RHEL 9.4 image, so you might have to locate that yourself.

Is this machine for building the bootc image? If so, I just want to clarify - the machine that will actually run RHELAI doesn't have a direct dependency on the build machine, correct?

tssala23 commented 2 months ago

@tzumainn When you create the project could I also get access to the project

tzumainn commented 2 months ago

@tzumainn When you create the project could I also get access to the project

Definitely!

computate commented 2 months ago

@tzumainn Yes, I would like to work together with @tssala23 on this. The RHELAI image builder machine will be separate from the RHELAI machine with the GPUs. We will need to boot the RHELAI machine with the rhelai-dev-preview-bootc-ks.iso image that we will build on the RHELAI image builder machine at some point.

computate commented 2 months ago

@tzumainn It looks like I'm already registered for an account ctate@redhat.com, but I get Login failed: You are not authorized for any projects or domains. when I try to log in.

tzumainn commented 2 months ago

@computate is there an existing project this goes under, or should I create a new one? if the latter, just let me know what the name should be!

computate commented 2 months ago

@tzumainn probably a new one like rhelai or rhel-ai.

tzumainn commented 2 months ago

would research_rhelai be okay?

tssala23 commented 2 months ago

That should be good

tzumainn commented 2 months ago

Okay! I've created the project research_rhelai, added both of you to the project, and assigned the node MOC-R4PAC24U35-S1A. Let me know if you need anything else!

joachimweyl commented 2 months ago

@tzumainn 2 questions

  1. RHELAI project now has 1 of the 4 GPU nodes and now the addition of MOC-R4PAC24U35-S1A which is a FC830 or FC430?
  2. What is currently happening with the other 3 nodes?
    1. What needs to happen for the last 2 steps of this issue which is moving 1 to OpenShift test and one to prod?
tzumainn commented 2 months ago

@tzumainn 2 questions

  1. RHELAI project now has 1 of the 4 GPU nodes and now the addition of MOC-R4PAC24U35-S1A which is a FC830 or FC430?
  2. What is currently happening with the other 3 nodes? They are sitting in ESI waiting next steps?

@joachimweyl not quite - there is 1 GPU node leased to the ope project. I've been told the rhelai project will want a GPU node eventually, but it has not been assigned to them yet. So there is 1 GPU node leased, and 3 unleased.

Also for clarification - MOC-R4PAC24U35-S1A is leased to research_rhelai which is separate from rhelai. It's a FC430.

tssala23 commented 2 months ago

@joachimweyl From reading though the RHELAI doc it seems we will need to build the image on a separate machine to where RHEL AI will actually be deployed. So for now we only have 1 FC430.

tzumainn commented 2 months ago

Oh, and in regards to moving the nodes to OpenShift test and prod - that would be handled by someone on those teams, who should have access to the leased nodes.

joachimweyl commented 2 months ago

@jtriley or @aabaris what are the next steps to move these to OpenShift test and OpenShift prod?

tzumainn commented 2 months ago

Whoops, I now understand that three of the GPU nodes should be leased out - I've created leases for the following nodes:

They've all been leased to the nerc-admins project; let me know if they should go into a different project.

joachimweyl commented 2 months ago

@dystewart was the node allocated for ope the one for the test cluster?

tzumainn commented 2 months ago

Okay, wait - looking at the node leases, MOC-R8PAC23U27 was previously leased to ope at Dylan's request, for the test cluster. I'll delete the lease to MOC-R8PAC23U31 and save that for rhelai.

joachimweyl commented 2 months ago

@jtriley & @aabaris to be clear the two going to OpenShift Prod are MOC-R8PAC23U28 MOC-R8PAC23U30

joachimweyl commented 2 months ago

new ESI project for testing nerc-test

tzumainn commented 2 months ago

I've created the nerc-test project and created a lease for node MOC-R8PAC23U27. All the members of nerc-admins and Dylan have been added to the new project.

Let me know if I got anything confused!

joachimweyl commented 2 months ago

@dystewart have you moved MOC-R8PAC23U27 to openshift test cluster yet?

tzumainn commented 2 months ago

Moved MOC-R8PAC23U28 to the OpenShiftBeta project!

joachimweyl commented 1 month ago

@dystewart please update this issue once you have moved the last node to test cluster