volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.13k stars 953 forks source link

Handle coscheduling with cluster-autoscaler #691

Closed groszewn closed 2 years ago

groszewn commented 4 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

As is referenced in the docs, the scheduler doesn't currently work well when the cluster-autoscaler is used. The exception thrown is

Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.0.0.1/api/v1/namespaces/testnamespace/pods. Message: admission webhook "validatepod.volcano.sh" denied the request: failed to create pod <testnamespace/test-job-driver> as the podgroup phase is Pending. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=admission webhook "validatepod.volcano.sh" denied the request: failed to create pod <testnamespace/test-job-driver> as the podgroup phase is Pending, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).

As can be seen in the exception, the cluster has not yet scaled to meet the resource requests to move the podgroup past the Pending phase.

What you expected to happen:

Ideally, the scheduler would be aware the the cluster is autoscaling and hold off on throwing an exception.

How to reproduce it (as minimally and precisely as possible):

Since this exception occurs when the cluster is autoscaling, the submission will be dependent on how many resources are currently available in your cluster and whether it needs to scale to handle the requested resources. Any manifest that leverages the volcano batch scheduler that requires a cluster-autoscaling event will likely throw the above error.

k82cn commented 4 years ago

/cc @thandayuthapani

k82cn commented 4 years ago

/assign

groszewn commented 4 years ago

Hey @k82cn, just wanted to circle back and see if there has been any shift in prioritization on this?

stale[bot] commented 4 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 3 years ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

SkinyMonkey commented 3 years ago

Hi, any update on this?

This would determine if with use volcano or not.

k82cn commented 3 years ago

Hi, any update on this?

This would determine if with use volcano or not.

what's your scenario?

SkinyMonkey commented 3 years ago

I would like to be able to use Volcano but allow the autoscaler to kick in when too much jobs are waiting.

I'm not sure that this feaure exists yet or is even possible.

As this ticket was closed without a PR I assume the original pb, wasn't examined?

Also what would happen if I had volcano in place + autoscaler and nodes number set to 0?

We'd like to able to cut our machines when they're not used, on the week end for example, which the autoscaler can provide but not Azure, which does not allow to schedule timed shutdowns on k8s clusters.

So if we were to use volcano but could not use the autoscaler that might be a block

bowenli86 commented 3 years ago

Hi @k82cn @kevin-wangzefeng , do you have any update on this?

bowenli86 commented 3 years ago

I wonder, since Volcano doesn't support autoscaling, how does Huawei handle the autoscale requirements? any workaround you can share?

k82cn commented 3 years ago

Also what would happen if I had volcano in place + autoscaler and nodes number set to 0?

@SkinyMonkey , Volcano can work with autoscaling. If node is set to 0, no pod will be scheduled :) This issue is talking about gang-scheduling with autocaling, one feature of Volcano.

, since Volcano doesn't support autoscaling

Volcano can work with autoscaling. For this issue, it's about how gang-scheduling/co-scheduling work with autoscaling :)

brickyard commented 3 years ago

@k82cn Seems like a pretty common use-case to use cluster-autoscaler in cloud environments to minimize cloud cost waste and also take advantage of gang-scheduling/co-scheduling. I am also running into this error in the same way as the OP using Spark.

Do you have any update or workaround on this? Much appreciated.

k82cn commented 3 years ago

@brickyard , we're doing some investigation about this requirement; that's interesting & important for us.

aleclerc-sonrai commented 3 years ago

I had the idea to create ‘over-provisioned’ pods with a very low priorityclass at the same time as submitting my spark job, with the thought that volcano would be smart enough about pre-emption and pre-empt those pods and schedule the job, as it would technically have enough resources to fulfil the job, but that doesn’t seem to be the case. Would this be accurate that volcano wouldn’t look at this as part of deciding whether or not it can schedule a job. Has anyone gone down this road yet?

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 3 years ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Thor-wl commented 2 years ago

/cc @qiankunli

stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

d4l3k commented 2 years ago

edit: this is incorrect

Is there any update on this? Would be nice if the overcommit plugin could be extended to handle this case. I'm running into this issue trying to launch TPU based jobs on GKE

https://cloud.google.com/tpu/docs/kubernetes-engine-setup

spec:
  containers:
  - name: example-container
    resources:
      limits:
        cloud-tpus.google.com/v2: 8

Volcano is refusing to schedule these jobs since there's no TPU available hosts, but there's no TPU available hosts because there's no pods scheduled. It doesn't seem like there's any way to force GKE to allocate TPU hosts

d4l3k commented 2 years ago

Just spent some more time here and this is actually incorrect. Volcano can schedule TPU jobs without issue, I just was missing an annotation.

Without the annotation the job just gets stuck without any errors

  annotations:
     tf-version.cloud-tpus.google.com: "2.6.0"
stale[bot] commented 2 years ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 2 years ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

d4l3k commented 1 year ago

@Thor-wl are there any plans to improve this?

anovv commented 1 year ago

@Thor-wl @k82cn can you please give an update on this? Does gang-scheduling work with cluster-autoscaler? We are exploring if Volcano is a good solution for us

DmitriGekhtman commented 4 months ago

FYI, I think it might have been fixed here: https://github.com/volcano-sh/volcano/pull/2602