nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Add 2 A100s to ocp-beta-test cluster #690

Closed tssala23 closed 1 week ago

tssala23 commented 3 weeks ago

This cluster is going to be used by Michael Clifford and Sanjay Arora to run some instruct lab experiments. They have said they will need at least 8 GPUs, so 2 A100 servers

tzumainn commented 3 weeks ago

There are four A100s, but three of them are already leased:

Is ope still using their A100?

hpdempsey commented 3 weeks ago

On Tue, Aug 20, 2024 at 11:27 AM Tzu-Mainn Chen @.***> wrote:

There are four A100s, but three of them are already leased:

  • MOC-R8PAC23U27: ope
  • MOC-R8PAC23U28: nerc-admins
  • MOC-R8PAC23U30: nerc-admins
  • MOC-R8PAC23U31

Is ope still using their A100?

We have more A100s. What do we need to do to get them to show up in ESI?

Reply to this email directly, view it on GitHub https://github.com/nerc-project/operations/issues/690#issuecomment-2299137551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMGMIENQXLVIFLNBB4VNU7TZSNN7LAVCNFSM6AAAAABM2DS4IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGEZTONJVGE . You are receiving this because you were assigned.Message ID: @.***>

hpdempsey commented 3 weeks ago

On Tue, Aug 20, 2024 at 12:52 PM Heidi Dempsey @.***> wrote:

On Tue, Aug 20, 2024 at 11:27 AM Tzu-Mainn Chen @.***> wrote:

There are four A100s, but three of them are already leased:

  • MOC-R8PAC23U27: ope
  • MOC-R8PAC23U28: nerc-admins
  • MOC-R8PAC23U30: nerc-admins
  • MOC-R8PAC23U31

Is ope still using their A100?

We have more A100s. What do we need to do to get them to show up in ESI?

There are 64 altogether, although some are currently allocated to OpenStack, so we probably don't care about ESI having those in our inventory, but the rest should be.

Reply to this email directly, view it on GitHub https://github.com/nerc-project/operations/issues/690#issuecomment-2299137551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMGMIENQXLVIFLNBB4VNU7TZSNN7LAVCNFSM6AAAAABM2DS4IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGEZTONJVGE . You are receiving this because you were assigned.Message ID: @.***>

tzumainn commented 3 weeks ago

@hakasapl are the additional A100s something that can be added to ESI?

tssala23 commented 3 weeks ago

@tzumainn for the time being the 1 free one can work to get started with

tzumainn commented 3 weeks ago

yep - added to OpenShiftBeta!

tssala23 commented 3 weeks ago

Thank you!

hakasapl commented 3 weeks ago

I think the A100 nodes not in ESI are in NERC openstack/openshift. Maybe we should bring it up in the weekly nerc meeting? Unless they have already removed those nodes from NERC and I can just switch them over? If that's the case which nodes are they?

joachimweyl commented 3 weeks ago

@hpdempsey what project are these 2 Nodes for and does it have funding?

hpdempsey commented 2 weeks ago

These nodes are for the project to run InstructLab/RHELAI on OpenShift AI in the MOC. I am funding it.

hpdempsey commented 2 weeks ago

We can't use production OpenShift AI for this project, which is why we can't use the GPUs that might already be allocated to production OpenShift. Unless those are all actively in use, we should re-allocate them to to this project through ESI.

tssala23 commented 2 weeks ago

@joachimweyl are we good to move the two A100s assigned to nerc admins to this project?

tzumainn commented 2 weeks ago

The two a100s allocated to nerc-admins are in the 'available' state, meaning they are not being actively used right now. @joachimweyl is it possible to get confirmation that they are okay to move?

joachimweyl commented 2 weeks ago

@tzumainn would you please add MOC-R8PAC23U28 to the OpenShiftBeta project? Once you're done please check off the InstructLab checkbox in this issue.

tzumainn commented 2 weeks ago

Done!

larsks commented 2 weeks ago

@hpdempsey I was helping @tssala23 diagnose a problem with unexpected reboots on the GPU node, and it looks like there may be a hardware problem. The event log for the node contains multiple repetitions of:

A software NMI has occurred on system ThinkSystem SD650-N V2.   August 27, 2024 5:38:10 PM
Fault in slot All PCI Error on system ThinkSystem SD650-N V2.   August 27, 2024 5:38:02 PM
An Uncorrectable PCIe Error has Occurred at Bus 00 Device 1C Function 00. The Vendor ID for the device is 8086 and the Device ID is A190. The physical slot number is 0.    August 27, 2024 5:37:59 PM

@hakasapl thinks a virtual reseat on the chassis controller might clear up the problem, but the chassis controllers for these systems are not available at the moment. He has asked Tech Square to go in and physically reseat the blade, which will probably happen tomorrow.

@tssala23 is looking into the possibility of simply swapping in another node, rather than waiting for the reseat.

tzumainn commented 2 weeks ago

@joachimweyl given the issues would it be possible to assign them MOC-R8PAC23U30 instead?

joachimweyl commented 2 weeks ago

approved.

tzumainn commented 2 weeks ago

Okay, done! Should I cancel the lease for MOC-R8PAC23U28 then, or leave it assigned to OpenShiftBeta?

joachimweyl commented 2 weeks ago

cancel please

tzumainn commented 2 weeks ago

Done!

hpdempsey commented 2 weeks ago

Can ESI show this as down or something so no one else tries to use it until we know it is working? Also, I assume @joachimweyl needs to remove this from billing somehow, since no one can use it while in this state.

tzumainn commented 2 weeks ago

@hakasapl the re-seat happened, correct? so it's possible this node works again?

tzumainn commented 2 weeks ago

if not, I'll put it in maintenance mode, which is what we do to show a node has issues

tssala23 commented 2 weeks ago

@tzumainn the re-seat did happen. So it is possible that it is working again.

tzumainn commented 2 weeks ago

Okay! Should this node be assigned somewhere then?

joachimweyl commented 2 weeks ago

@hpdempsey unless you know of upcoming requests from RH for an A100 node @tzumainn I would suggest nerc-admins so it is ready for a quick move to prod.

tssala23 commented 2 weeks ago

@joachimweyl @hpdempsey I'm not sure about the utilization of these in prod, but the instruct lab on RHOAI will need another one at some point as they want to demonstrate multi-node training and it sounds like we should be able to test running RHEL AI which would also need one (Though there may be one already on hold for RHEL AI purposes).

joachimweyl commented 2 weeks ago

MOC-R8PAC23U31 is in the rhelai project. MOC-R8PAC23U30 is allocated for InstructLab. Since Prod does not need it yet we can leave it as a floater in the nerc-admins ESI project so as soon as prod does need it we can move it back unless InstrcutLab or some other RH project needs it first then we can change it's project.

tssala23 commented 1 week ago

MOC-R8PAC23U30 & MOC-R8PAC23U25 have both been attached to the cluster