tb decided on: How to calculate GPUs when sliced

schwesig commented 4 months ago

Story needs more details, feedback, research, but first approach to keep it in mind with this issue. please comment/feedback

This issue should create the awareness of

usage of GPUs needs to be measured/judged in the context of the slicing 1.1. e.g. 50% usage of a full A100 GPU (7g.40gb) means sth different than 50% of a /4-mig sliced GPU (2g.10gb)
the cost overhead 2.1. what cost structure is the basic (do we lease by board, GPU, sliced GPU?) 2.2. because: when we migslice a GPU, we create some unusable overhead 2.2.1. e.g. a full A100-40GB has 7g.40gb 2.2.2. a 2 times migsliced A100-40GB has 2x 3g.20gb (6g & 40gb -> -1g vs full) 2.2.3. a 4 times migsliced A100-40GB has 3! x 2g.10gb (6g & 30gb -> -1g & -10gb vs full) 2.2.4. an 8 times migsliced A100-40GB has 7x 1g.5GB (7g & 35gb -> -5gb vs full)

this makes a 2g.10gb more expensive than 1/4 of a 7g.40gb

source

/CC @hpdempsey @msdisme @joachimweyl

schwesig commented 4 months ago

when we talked about it and got some more ideas about it, this may need to be split into two issues later 1 for the metrics idea 1 for the cost allocation

msdisme commented 4 months ago

Grooming discussion July 17:

MIG is preset, generally
charges by GPU or portion of GPU currently charged based on 1 full GPU
can we ensure that each slice goes to a single GPU till used when in MIG mode (Naved, probably not - does not live in scheduler now. )
Kruze GPU is meant to help address this

schwesig commented 3 months ago

https://stackoverflow.com/questions/78653544/why-use-mps-time-slicing-or-mig-if-nvidias-defaults-have-better-performance

Default:
- The GPU and its behavior is unmodified. The normal/default behavior for a CUDA GPU is to allow multiple tenants to share the GPU in a mostly unspecified fashion. However note the above statement about K8s use with GPU Operator - the default case implies exclusive per-process access to the GPU. This means for each inference request, the request will have access to the entire GPU for that request. But multiple requests from separate tenants will incur some kind of serialization.
MIG - https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
Time Slicing - https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
MPS - https://docs.nvidia.com/deploy/mps/#when-to-use-mps

schwesig commented 3 months ago

https://raw.githubusercontent.com/nebuly-ai/nos/main/docs/en/docs/dynamic-gpu-partitioning/partitioning-modes-comparison.md

Partitioning mode	Supported by `nos`	Workload isolation level	Pros	Cons
Multi-instance GPU (MIG)	✅	Best	Processes are executed in parallel Full isolation (dedicated memory and compute resources)	Supported by fewer GPU models (only Ampere or more recent architectures) Coarse-grained control over memory and compute resources
Multi-process server (MPS)	✅	Medium	Processes are executed parallel Fine-grained control over memory and compute resources allocation	No error isolation and memory protection
Time-slicing	❌	None	Processes are executed concurrently Supported by older GPU architectures (Pascal or newer)	No resource limits No memory isolation Lower performance due to context-switching overhead

"nos is the open-source module to efficiently run AI workloads on Kubernetes, increasing GPU utilization, cutting down infrastructure costs and improving workloads performance."

schwesig commented 3 months ago

https://www.infracloud.io/blogs/gpu-sharing-techniques-guide-vgpu-mig-time-slicing/

schwesig commented 3 months ago

https://github.com/nebuly-ai/nos/tree/main/demos/gpu-sharing-comparison

hpdempsey commented 3 months ago

We tested MIG slicing as a capability, but it is not offered yet as a service. It is not clear to me at all that any of our existing or forecast projects want anything less than a GPU dedicated to them. All the Red Hat requests so far are for multiple full GPUs. I don't know what is on the horizon for requests coming from BU or other academic users. Can @msdisme provide some kind of forecast for this? This will help us decide the priority of the work.

I suspect the work has to be done for each type of GPU that we are going to support. This looks like the observability and charging work could be quite significant, based on the rough info in this issue so far. As @schwesig indicated, let's break this up into the work to reflect GPU usage by project in observability first, and the billing effort later, because we need which we need the former even if we don't pursue "sliced" billing models later. Having a forecast of how many projects we believe will be satisfied with sliced GPUs, and how this will affect the MOC's charges (reducing them significantly from the current pre-emptive based GPU by 24 hr periods billing policy) is necessary to pursue the second batch of work efficiently.

Can we link this issue to the issue for adding GPU usage/billing for the dedicated bare-metal GPU case (allocated through ESI), which seems like it is in highest demand currently. There is no MIG option for the bare-metal case, so ultimately obeservability and billing will need to cover both cases? (If there isn't currently an issue for the bare-metal GPU case, please create one.)

schwesig commented 2 months ago

I found sth today, maybe for some ideas

https://github.com/nerc-project/operations/issues/495
Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values
353
- https://github.com/NVIDIA/dcgm-exporter/issues/353

nerc-project / operations

tb decided on: How to calculate GPUs when sliced #643

353