nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Ensure H100s can be moved around (most likely through ESI) #621

Open joachimweyl opened 5 days ago

joachimweyl commented 5 days ago

Motivation

We will need to have our H100s able to shift from OpenStack to OpenShift to BM. Currently our most promising way to do this is ESI. there are some hurtles to get past before we can do this so we are going to track those here and discuss ways to get over them.

Completion Criteria

A plan set in place to ensure H100s will be able to be moved between all of our offerings.

Description

Discussion

Completion dates

Desired - 2024-07-10 Required - TBD

naved001 commented 5 days ago

While partition keys are analogous to VLAN IDs I don't know how the rest of the things work. Like, could someone just setup a subnetmanager on their host and make changes to the IB fabric?

I'd really like to make sure that isolation is guaranteed on the IB networks before trying to do any of the hacky work in ESI.

hakasapl commented 5 days ago

@naved001 I'm not sure, we'll need to test that once we install the borrowed the switch.

We can borrow an unmanaged EDR IB switch from UMass (Mellanox SB7790). Because it is unmanaged we'll need an external subnet manager, which we can just host on an ESI node for testing purposes.

tzumainn commented 5 days ago

In the meantime, here's the parameters of what I think can be done for this in ESI, as well as some considerations:

Just some of my thoughts, I'd be interested in any discussion!