palantir / k8s-spark-scheduler

A Kubernetes Scheduler Extender to provide gang scheduling support for Spark on Kubernetes
Apache License 2.0
176 stars 42 forks source link

Pods scheduled stuck in Pending state #150

Open PeteW opened 4 years ago

PeteW commented 4 years ago

I'm attempting to run spark-thriftserver using this scheduler extender. If you're not familiar, spark-thriftserver runs in client mode (local driver, remote executors). The thrift server exposes a JDBC connection which receives queries and turns these into spark jobs.

The command to run this looks like:

/opt/spark/sbin/start-thriftserver.sh \
  --conf spark.master=k8s://https://my-EKS-server:443 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.namespace=spark \
  --conf spark.kubernetes.container.image=my-image \
  --conf spark.kubernetes.file.upload.path=file:///tmp \
  --conf spark.app.name=sparkthriftserver \
  --conf spark.kubernetes.executor.podTemplateFile=/path/to/executor.template \
  --verbose

spark-defaults.conf looks like:

spark.sql.catalogImplementation hive
spark.kubernetes.allocation.batch.size 5
spark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 30s
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 50

So far, I've applied the extender.yaml file as-is without any modifications. This instantiates two new pods under the spark namespace both in Running state with names starting with "spark-scheduler-". describe pod XXX yields some troubling information about them:

Events:
  Type     Reason     Age                From                                             Message
  ----     ------     ----               ----                                             -------
  Normal   Scheduled  15m                fargate-scheduler                                Successfully assigned spark/spark-scheduler-7bbb5bb979-fhktn to fargate-ip-XXX-XXX-XXX-XXX.ec2.internal
  Normal   Pulling    15m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Pulling image "gcr.io/google_containers/hyperkube:v1.13.1"
  Normal   Pulled     15m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Successfully pulled image "gcr.io/google_containers/hyperkube:v1.13.1"
  Normal   Created    14m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Created container kube-scheduler
  Normal   Started    14m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Started container kube-scheduler
  Normal   Pulling    14m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Pulling image "palantirtechnologies/spark-scheduler:latest"
  Normal   Pulled     14m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Successfully pulled image "palantirtechnologies/spark-scheduler:latest"
  Warning  Unhealthy  14m (x3 over 14m)  kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Liveness probe failed: Get https://XXX.XXX.XXX.XXX:8484/spark-scheduler/status/liveness: dial tcp XXX.XXX.XXX.XXX
:8484: connect: connection refused
  Normal   Killing    14m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Container spark-scheduler-extender failed liveness probe, will be restarted
  Normal   Created    14m (x2 over 14m)  kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Created container spark-scheduler-extender
  Normal   Pulled     14m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Container image "palantirtechnologies/spark-scheduler:latest" already present on machine
  Normal   Started    14m (x2 over 14m)  kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Started container spark-scheduler-extender
  Warning  Unhealthy  14m (x4 over 14m)  kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Readiness probe failed: Get https://XXX.XXX.XXX.XXX:8484/spark-scheduler/status/readiness: dial tcp XXX.XXX.XXX.XXX:8484: connect: connection refused
  Warning  Unhealthy  14m                kubelet, fargate-ip-XXX-XXX-XXX-XXX.ec2.internal  Liveness probe failed: HTTP probe failed with statuscode: 503

When I attempt to run the driver above (which launches properly), because the spark.dynamicAllocation.minExecutors is set to 1 the driver immediately requests a single executor pod at startup. The pod itself remains indefinitely in a pending state. describe pod XXX seems to suggest that no nodes satisfy the pod's scheduling criteria:

Events:
  Type     Reason            Age                   From             Message
  ----     ------            ----                  ----             -------
  Warning  FailedScheduling  3m49s (x37 over 14m)  spark-scheduler  0/4 nodes are available: 4 Insufficient pods.

What I'm having trouble figuring out is:

  1. what exactly is the criteria which causes no nodes to be insufficient? I am not making use of any instance-group labels, nor any custom labels. All the nodes accept the spark namespace. sorry to ask but I am struggling to find the proper steps to take to narrow down the issue.
    1. are the "liveness" error messages above signifying the issue resides with an unhealthy scheduler? I can ssh into these two scheduler instances if needed but not sure what logs to take a closer look at after i open the shell.

If it helps, this is using aws fargate as the compute resources behind kubernetes, but based on what i know so far that shouldnt be an issue.

onursatici commented 4 years ago

Unfortunately spark scheduler extender currently doesn't support launching client mode applications to kubernetes. It assumes that a driver will be launched in the cluster, which then proceeds to request executors.

That being said, I think your executor pods are failing to be scheduled before consulting with the extender though, as the message says 4 Insufficient pods, which is kube-scheduler's way of telling you that all 4 nodes in your cluster are over their pod count limit.

If you have fixed that by increasing your pod limit or killing existing pods, then I would expect your pods to be still stuck at pending, but with a message telling you something like failed to get resource reservations as it will be looking for the spaces that the driver reserved, which doesn't happen in client mode.

for your second question, I think it is got to do with a network problem from the health probe into your container, because the message for the stuck pod indicates that kube-scheduler considered that pod, hence is operational

PeteW commented 4 years ago

Unfortunately spark scheduler extender currently doesn't support launching client mode applications to kubernetes. It assumes that a driver will be launched in the cluster, which then proceeds to request executors.

That nuance wasnt clear to me but now that it is I think I can work with this. Good to know thanks.

That being said, I think your executor pods are failing to be scheduled before consulting with the extender though, as the message says 4 Insufficient pods, which is kube-scheduler's way of telling you that all 4 nodes in your cluster are over their pod count limit. If you have fixed that by increasing your pod limit or killing existing pods, then I would expect your pods to be still stuck at pending, but with a message telling you something like failed to get resource reservations as it will be looking for the spaces that the driver reserved, which doesn't happen in client mode.

This is actually how AWS fargate works as a resource negotiator. Hardware is allocated on-demand, always one-node-per-pod. For example, say spark requests resources for a new executor. This of course begets a request to kubernetes for an executor pod. In the case of fargate this begets a request to allocate a new VM just-in-time for the lifetime of the executor, billed by the second. In 60-90 seconds (usually) fargate returns a new VM with kubernetes tooling pre-installed/configured sized to the request plus some extra RAM for kubelet.

When running kubectl get nodes I can see the new node for the requested pod provisioned as expected. But there's something about this new node/VM the scheduler extender rejects. I can go into more detail, even a step-by-step demonstration if that helps elaborate. But the only key point I want to make is that there might be something different about this cloud-based node-allocation behavior which doesnt jive with the scheduler extender, at least not without customization.

for your second question, I think it is got to do with a network problem from the health probe into your container, because the message for the stuck pod indicates that kube-scheduler considered that pod, hence is operational

I dont have a good response for this point. Within the VLAN containing nodes there are no current restrictions for cross-node communication. It seems the "connection refused" errors come from requests where the client and server are the same IP. This might be an oversight I can find by looking closer.