Open shalberd opened 1 year ago
cc @andrewballantyne
could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints
could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints
This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma
This request is for allowing more flexibility in general tolerations for Notebooks (and in general I imagine all of a set of DS Project resources -- unrelated to GPUs or Accelerators)
I think this predates the UX flow. Moving to UX.
I think we need to design a way to bring the NotebookTolerations
cluster settings to the project so the user can manage their resources against tolerations. This may be more possible with the added state in the admin view of Habana part 2 & the toleration modal. https://github.com/opendatahub-io/odh-dashboard/issues/1255
This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma
Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.
This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma
Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.
@bdattoma Yes it is -- when you create the AcceleratorProfile (or modify the one we create on migration) you can pick whatever tolerations you want and as many as you want. Our old world was a single static toleration, so we migrate with that -- but it is modifiable.
The Admin UI is coming in 2.6 I believe, and is currently in incubation if you want to check it out. The tracker: https://github.com/opendatahub-io/odh-dashboard/issues/1255
Feature description
Currently, the notebook toleration settings from odh dashboard config apply to all notebooks in all namespaces.
Assume we have a cluster with different dedicated nodes per customer:
The idea is having namespaces per customer, it can be one namespace per user, I have grown used to that concept, but there needs to be a way to ensure that users / workbench namespaces can belong to different customers and have different scheduling placements for pods in terms of on which node they land.
So, my suggestion would be to
Describe alternatives you've considered
For now, we do not have multiple customers, with data science projects namespaces grouped per customer, so we schedule all notebooks on nodes with a given node taint key, e.g. key: opendatahub, using the existing mechanism in OdhDashboardConfig.
But going forward, the issue of moving to namespace-specific instead of for-all configs will become important. Be it for tolerations or for things like linking all service accounts to an image pull secret, also those dynamic ones for notebooks in data science projects.
Anything else?
No response