[Feature Request]: Move tolerationSettings from notebooks generally to data science projects

shalberd commented 1 year ago

Feature description

Currently, the notebook toleration settings from odh dashboard config apply to all notebooks in all namespaces.

Assume we have a cluster with different dedicated nodes per customer:

nodes A-B (worker nodes) tainted NoExecute, Equal, key: customer, value: customer1
nodes C-D (worker nodes) tainted NoExecute, Equal, key: customer, value: customer2

The idea is having namespaces per customer, it can be one namespace per user, I have grown used to that concept, but there needs to be a way to ensure that users / workbench namespaces can belong to different customers and have different scheduling placements for pods in terms of on which node they land.

So, my suggestion would be to

move notebookTolerationSettings in ODH Dashboard Config being a global setting for all notebooks in all namespaces to tolerationSettings on specific Data Science Projects, that is, namespace / project-specific
change effect from NoSchedule to NoExecute to ensure that existing pods on the node are evicted and moved to a non-taint node
change operator from Exists to Equal. Exists is ok for evaluating node taint keys like nvidia.com/gpu, where the value does not matter, I presume. But it is not ok for tolerations where key AND value must match, e.g. my described scenario above. Just matching key: customer would not be enough.

Describe alternatives you've considered

For now, we do not have multiple customers, with data science projects namespaces grouped per customer, so we schedule all notebooks on nodes with a given node taint key, e.g. key: opendatahub, using the existing mechanism in OdhDashboardConfig.

But going forward, the issue of moving to namespace-specific instead of for-all configs will become important. Be it for tolerations or for things like linking all service accounts to an image pull secret, also those dynamic ones for notebooks in data science projects.

Anything else?

No response

Gkrumbach07 commented 1 year ago

cc @andrewballantyne

bdattoma commented 11 months ago

could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints

andrewballantyne commented 11 months ago

could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

This request is for allowing more flexibility in general tolerations for Notebooks (and in general I imagine all of a set of DS Project resources -- unrelated to GPUs or Accelerators)

andrewballantyne commented 11 months ago

I think this predates the UX flow. Moving to UX.

UX Context

I think we need to design a way to bring the NotebookTolerations cluster settings to the project so the user can manage their resources against tolerations. This may be more possible with the added state in the admin view of Habana part 2 & the toleration modal. https://github.com/opendatahub-io/odh-dashboard/issues/1255

bdattoma commented 11 months ago

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.

andrewballantyne commented 11 months ago

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.

@bdattoma Yes it is -- when you create the AcceleratorProfile (or modify the one we create on migration) you can pick whatever tolerations you want and as many as you want. Our old world was a single static toleration, so we migrate with that -- but it is modifiable.

The Admin UI is coming in 2.6 I believe, and is currently in incubation if you want to check it out. The tracker: https://github.com/opendatahub-io/odh-dashboard/issues/1255

opendatahub-io / odh-dashboard