Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster

nerc-project / operations

Issues related to the operation of the NERC OpenShift environment

2 stars 0 forks source link

Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster #762

Closed dystewart closed 2 weeks ago

dystewart commented 1 month ago

Motivation

The AI4DD team is interested in using only A100 GPU nodes for their research. With V100s also in the cluster, simply requesting a gpu cannot guarantee it lands on an A100, without some manual intervention. There are 2 ways we can attack this dilemma:

Taint a node(s) and leverage tolerations to land workloads on A100 nodes. If these are long running or constantly running workloads or if they need to be run on a single GPU host, I think this would make sense, and it's very simple to enable and disable this behavior. This also gurantees that the tainted A100 GPU node will be available when needed.
Utilize nodeSelector in AI4DD workloads to land on A100 nodes. This is the simpler option, but there is no guarantee that the A100 resources will necessarily be available.

Completion Criteria

Assist the AI4DD team in implementing the desired fix.

Description

[ ] Determine which solution from above will work best for the AI4DD folks

Completion dates

Desired - ASAP

EldritchJS commented 1 month ago

Thanks @dystewart for creating this. One note of clarification: there will be three services/pods that need to be on A100 nodes, and 40-50 RHOAI workbenches that won't need GPU nodes. Ideally a means for wrangling these disparate needs accordingly would be ideal. I assume for the workbenches we just specify in the .yaml that no accelerator is needed.

Your initial option above seems to make sense to me but I'll clarify the three services and how they're expected to operate:

Fine tuning application for the workshop presenter only
Redundant instance of #1
Inference service that is expected to take requests from the RHOAI workbenches

I expect the inference service will need autoscale enabled as the inference is reliant on GPU nodes.

Any thoughts on this?

joachimweyl commented 1 month ago

here is documentation on how to select the specific GPU.

naved001 commented 1 month ago

Utilize nodeSelector in AI4DD workloads to land on A100 nodes. This is the simpler option, but there is no guarantee that the A100 resources will necessarily be available.

I think that's the solution that we have in the NERC documentation. It is true that if there are no A100s then the pod will not be scheduled and will stay in a pending state; in that case it would be nice to get an estimate of how many GPUs should be available for this project.

EldritchJS commented 1 month ago

Thanks for this. I am currently working with the project dev folks to get those estimates and will post them here as soon as I have them!

joachimweyl commented 2 weeks ago

AI4DD is using A100s