project-codeflare / multi-cluster-app-dispatcher

Holistic job manager on Kubernetes
Apache License 2.0
106 stars 62 forks source link

[Feature request] Expose cluster available capacity for deployment sizing #285

Open yuanchi2807 opened 1 year ago

yuanchi2807 commented 1 year ago

Application developers of scaleout runtime such as Ray struggle to size and configure deployments in order to fit the currently available capacity. For example, a total of 16 GPUs and 320 GB memory is desired with the flexibility of 16 pods of 1 GPU/20 GB or 8 pods of 2 GPU/40 GB or 2 pods of 8 GPU/160 GB each.

Currently, at job submission, there is no availabiity information to determine which combination(s) are feasible to launch immediately. This feature request is to ask for MCAD to expose an API to query available capacity for the application to attempt a feasible deployment configuration. Futher, it is highly desirable to reserve resources if a feasible configuration is identified in high contention workload scenarios.

CC @starpit @asm582 @klwuibm

accorvin commented 1 year ago

We discussed this in our MCAD developer sync today. Some takeaways:

  1. We don't think it makes sense to put this capability directly into MCAD itself. Kubernetes should be the source of truth for this information. We expect to provide information on available cluster capacity in the Open Data Hub dashboard alongside MCAD queue information to make it easy for an MCAD user to get this information.

  2. We think we should update InstaScale to set a status field in an MCAD AppWrapper based on cluster scale up events/decisions. This would allow you to see that, for a given AppWrapper, the decision has been made to scale up the cluster and then get insight into the status of this scale up. We will plan to implement this, but we need to do some further design work to figure out exactly how that should work.

@yuanchi2807 if we had these two things, do you think we'd solve your problem here?

yuanchi2807 commented 1 year ago

Yes. Checking available capacity to resize deployment configuration accordingly by the job submitter will lift the guessing game.

CC @starpit @klwuibm

accorvin commented 1 year ago

Great. We'll keep you updated as we work on implementing these things.

asm582 commented 1 year ago

FYI @dimakis @anishasthana another metrics requirement