Closed dystewart closed 2 weeks ago
Thanks @dystewart for creating this. One note of clarification: there will be three services/pods that need to be on A100 nodes, and 40-50 RHOAI workbenches that won't need GPU nodes. Ideally a means for wrangling these disparate needs accordingly would be ideal. I assume for the workbenches we just specify in the .yaml that no accelerator is needed.
Your initial option above seems to make sense to me but I'll clarify the three services and how they're expected to operate:
I expect the inference service will need autoscale enabled as the inference is reliant on GPU nodes.
Any thoughts on this?
here is documentation on how to select the specific GPU.
Utilize nodeSelector in AI4DD workloads to land on A100 nodes. This is the simpler option, but there is no guarantee that the A100 resources will necessarily be available.
I think that's the solution that we have in the NERC documentation. It is true that if there are no A100s then the pod will not be scheduled and will stay in a pending state; in that case it would be nice to get an estimate of how many GPUs should be available for this project.
Thanks for this. I am currently working with the project dev folks to get those estimates and will post them here as soon as I have them!
AI4DD is using A100s
Motivation
The AI4DD team is interested in using only A100 GPU nodes for their research. With V100s also in the cluster, simply requesting a gpu cannot guarantee it lands on an A100, without some manual intervention. There are 2 ways we can attack this dilemma:
Completion Criteria
Assist the AI4DD team in implementing the desired fix.
Description
Completion dates
Desired - ASAP