nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Research: when are GPUS not accessable by others projects/users #639

Open schwesig opened 1 month ago

schwesig commented 1 month ago

An important thing to note, is that we are generating the billable invoices not by gathering metrics from the actual GPU usage, but from the existence of a pod with a GPU request > 0. Would a pod in error state that never manages to execute its code, however executes on a node with a GPU count as utilizing the GPU? https://github.com/nerc-project/operations/issues/635#issuecomment-2214987530

This topic captures two aspects:

To discuss/define:

  1. thought: a GPU that is "blocked" and can not be used by any other project, can not generate "income" from another project; therefore it could be totally ok to bill that to the current project
  2. question, to check: does a request/allocation, used/not used/error state, block this GPU from being used by another project?
  3. decision. to make: if a GPU is blocked to be used by others, then we have (to evaluate) 3 cases 3.1. used - billable 3.2. not used/idle - inform the project (warning), but an apartment rented but not used, still costs rent :-) 3.3. error state 3.3.1. error by project - most likely billable, like case "not used", inform the project (warning) 3.3.2. error by NERC - most likely not billable, not projects fault

Originally posted by @schwesig in https://github.com/nerc-project/operations/issues/635#issuecomment-2216831790

/CC @msdisme @knikolla @naved001 @joachimweyl @hpdempsey

schwesig commented 1 month ago

Note: Maybe I am overcomplicating things, then please hold me back. But the rules for what resources can when and how reclaimed when in idle mode etc seems to be not trivial to me. And how to inform etc.

schwesig commented 1 month ago

from wednesday meeting: planned to discuss in one of the next ones. when e.g. Kim and Naved are back. but, the technical part could be checked/researched already. gives maybe a quick direction: if we can not free allocated GPU, then this billing issue can be reduced in its possible ways.

naved001 commented 1 month ago

The prometheus query that gathers pods using GPUs specifically looks for pods that have been assigned a node to run. This means pods that are in pending state are not counted.

https://github.com/CCI-MOC/openshift-usage-scripts/blob/main/openshift_metrics/openshift_prometheus_metrics.py#L28-L30

My understanding is that once a pod is scheduled then the resources are now claimed. From the kubernetes documention [1]:

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails.

[1] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-requests-are-scheduled

joachimweyl commented 1 month ago

I would say as soon as a GPU is not usable by others it should be charged for. That being said if it is not clear to a user that they have a GPU claimed by them we should make sure there is a way to alert them to this fact.