intel / intel-technology-enabling-for-openshift

The project focuses on Intel’s enterprise AI and cloud native foundation for Red Hat OpenShift Container Platform (RHOCP) solution enablement and innovation including Intel data center hardware features, Intel technology enhanced AI platform and the referenced AI workloads provisioning for OpenShift.
https://intel.github.io/intel-technology-enabling-for-openshift/
Apache License 2.0
17 stars 11 forks source link

Support the Heterogenous(different type of) Intel GPU cards in the same OCP cluster #216

Open uMartinXu opened 8 months ago

uMartinXu commented 8 months ago

Summary

Support the heterogeneous (different) Intel GPU cards in the same OCP cluster.

Detail

In the Scenario, When in the same cluster, different Intel GPU cards like Max-1100, Flex-140, and Flex-170 are provisioned. A mechanism should be provided for the users to pick up the proper GPU card they want to run the workloads on. To align with the taints/tolerance mechanism from Red Hat OpenShift AI accelerator Profile, We will use the same taints/tolerance mechanism for this feature.

To properly label(taint) the nodes in the cluster automatically, we will rely on the NFD node tainting feature.

So this feature rely on issue https://github.com/openshift/cluster-nfd-operator/issues/356

Note

The feature is for the heterogeneous (different) Intel GPU cards in the same OCP cluster. The different Intel dGPU cards in the same node are not supported.

mythi commented 8 months ago

/cc @tkatila

brgavino commented 8 months ago

The different Intel dGPU cards in the same node are not supported.

This is only because of GAS support, won't rely on NFD labelling taints/tolerations, correct?

How does this align with future resource requests via DRA? It does seem divergent at first glance