Starting multiple Spark jobs in close succession can lead to a deadlock due to resource exhaustion

jgoeres commented 4 years ago

This is probably mostly the same as https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/594 which has since been marked as closed, however, the solution described there doesn't seem to work for me.

To quickly describe the issue: assume you are launching multiple spark applications in rapid succession, that will in total not "fit" into your cluster due to the total of CPU or memory requests of all driver and executor pods is higher than the available resources in your cluster. In a perfect world, a few of these applications could be successfully scheduled (i.e., driver AND executor), would eventually complete, and thus free the resources for the remaining spark applications (i.e., a sort of self-organizing queuing).

Consider this minimal example: assume you have two spark jobs, each with a driver requesting one CPU core, and with an executor requesting two cores. Assume you have a little bit over 3 cores available on your single worker node (just think "minikube" ;-)).

If both spark applications are submitted at almost the same time, both drivers will be scheduled and eventually reach running state, leaving you with roughly one available core in our one node example cluster. Both drivers will try to start their single executor, which will both end up in "Pending" state. Without external intervention, the cluster on its own will not resolve this deadlock situation.

What I very naively hoped is that using the Volcano integration would add the driver and executor pods to the same "PodGroup" so that they are considered as one unit for scheduling and that therefore in my example above, the first spark job would start completely (i.e., driver and executor) and only after it completes, the second would start, that way circumventing the deadlock situation. Alas, this is not what I am observing. While the annotation required to assign driver and executor to the same prodgroup seems to happen, and while it seems that Volcano is working (I can see the corresponding pod events), the deadlock still occurs.

From what I understand, the problem here is that when the drivers are launched and added to the podgroup, Volcano cannot know that there is "more to come" for that pod group (cause only after the driver is scheduled and up and running, it will start the executor pod(s).

Maybe I am using Vulcano incorrectly (I followed the instructions on this page https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md and from what I can tell it is working). I also tried the "queue" batch scheduler option - same result. Any hints appreciated.

BTW: I am using spark-operator 1.1.0 with Spark 2.4.5

k82cn commented 4 years ago

There's feature named "min resource" in volcano, which will help to reserve some resource for spark job. As Volcano does not know how many resource should be reserved, it's better to let operator or user to set that value :)

sairamankumar2 commented 4 years ago

Hi, I am looking to be a contributor in this project and I am interested to work in this issue. If someone can guide me on the way to become a contributor and make improvements on this issue will be really helpful.

praveen-ag commented 4 years ago

@k82cn can you help me with an example of the min resource feature? I've installed Volcano on my cluster and followed the instructions to run a Sparkapplication on a given Volcano queue as mentioned here, but I'm not sure where to specify the min resource option. I've also looked into the Volcano documentation, and I'm not sure how this works.

Would greatly appreciate any help with this.

jgoeres commented 4 years ago

@k82cn @praveen-ag Yes, a bit of information on where and how to set the min resources would be needed here. I had a quick look through the sparkapplication CRD but found nothing that would allow setting the resources for the whole spark application (because this is what Volcano probably needs to know, since the executors and drivers are not launched by the same instance and at the same time, it cannot know the total by just looking at the sums of driver and executor limits and requests).

That aside, we meanwhile "solved" (=worked around) this problem in a different way: we are running this on EKS, and we simply created a second node group with node autoscaling via the cluster autoscaler and assign all spark pods to that node group. Works great. Still, on perspective we might need to run this on K8 clusters without node autoscaling, so an actual deadlock prevention (which is what the Volcano support seems to be giving us) would be great.

praveen-ag commented 4 years ago

It seems that minResources for the PodGroup generated by the SparkOperator is being picked as the sum of driver and executor resource requests that we make in the SparkApplication spec.

@k82cn I need your help in understanding this.

kubectl describe pg <pg_name> doesn't show the value for minResources

Name:         spark-volcano
Namespace:    dev
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"PodGroup","metadata":{"annotations":{},"name":"podgroup-cbfa8b7e-4094-4c1d-a89f-fa9f88214c35","namespace":"d...
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2020-06-18T16:43:16Z
  Generation:          89
  Resource Version:    86732802
  Self Link:           /apis/scheduling.volcano.sh/v1beta1/namespaces/dev/podgroups/podgroup-cbfa8b7e-4094-4c1d-a89f-fa9f88214c35
  UID:                 2bcf3b60-5a94-4952-9a28-5b63a60e589f
Spec:
  Min Member:  1
Status:
  Failed:  1
  Phase:   Inqueue
Events:
  Type     Reason         Age                 From     Message
  ----     ------         ----                ----     -------
  Warning  Unschedulable  86s (x3 over 88s)   volcano  0/1 tasks in gang unschedulable: pod group is not ready, 1 Running, 3 minAvailable.

When I tried to create a PG using the following manifest:

kind: PodGroup
metadata:
  name: spark-volcano
spec:
  minMember: 1
  minResources: 3

I got the following error: E0618 12:39:24.163194 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1beta1.PodGroup: v1beta1.PodGroupList.Items: []v1beta1.PodGroup: v1beta1.PodGroup.v1beta1.PodGroup.Spec: v1beta1.PodGroupSpec.MinResources: ReadMapCB: expect { or n, but found 3, error found in #10 byte of ...|sources":3}},{"apiVe|..., bigger context ...|7779d717f"},"spec":{"minMember":3,"minResources":3}},{"apiVersion":"scheduling.volcano.sh/v1beta1","|...

I had a look at the CRD definition for the PodGroup at the volcano repo and found that the CRD doesn't seem to define minResources. That installer file is what we used for setting up Volcano in our k8s cluster. Perhaps this is why we are unable to create a PG with minResources or why the podgroup generated by the operator doesn't seem to be showing it either.

VlIsHere commented 1 year ago

I stuck almost with the same thing as in this issue. I dont understand that logic and found only this dialog for explanation: https://groups.google.com/g/kubernetes-sig-scheduling/c/OAriEmhw448. Can anyone tell me: how it works? Why gang scheduling doesnt work in volcano and how can i avoid driver&executor creation when there are no resources in my node group?

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kubeflow / spark-operator

Starting multiple Spark jobs in close succession can lead to a deadlock due to resource exhaustion #883