Open onzo-vlamp opened 6 years ago
We are running jobserver on k8s. I don't think JobServer has any limitations as such which scheduler spark is using.
Noorul, do you want to contribute a doc on running job server on k8s? thanks,
On Mar 2 2018, at 4:35 pm, Noorul Islam K M notifications@github.com wrote:
We are running jobserver on k8s. I don't think JobServer has any limitations as such which scheduler spark is using. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub (https://github.com/spark-jobserver/spark-jobserver/issues/1038#issuecomment-370097766), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABA32x9IY_iFv4aoncym6G1TdIbX1SGNks5taeVOgaJpZM4Sai8n).
Thanks for letting us know @noorul!
I suspected so, but the docs in jobserver mention explicitly yarn, standalone etc.
If you don't have time for a proper doc, even a few notes would be great 🙂
Do you mean I should be able to submit spark jobs to kubernetes cluster?
Would that mean I simply have to modify local.conf and change master url?
master = k8s://xxxx
@dennischin It is not that straight forward. You spark master and workers should be running on k8s cluster.
I might be completely off-base here. Please forgive me if I am totally wrong. Any comments/info would be very helpful.
I was reviewing https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works
Could not spark jobserver act as the "client" that performs the spark-submit to the kubernetes cluster?
In my mind I would not be running spark master and workers on k8s cluster but using the newly built k8s scheduler that will create driver and executors as pods on the k8s cluster itself.
I am not familiar with k8s scheduler for spark. But we are running spark standalone on k8s. If you run spark outside of k8s and have client in k8s there will be network issues because of which driver and worker communication will not work properly.
@noorul,
When you said you have problems having spark outside of k8s and have client in k8s there will be network issues because of which driver and worker communication will not work properly.
Is this true with --deploy-mode = 'cluster'? If so, then the driver is deployed onto the worker nodes and everything should work?
In that case, the JobManager will be running on spark worker nodes. JobManager has to communicate back to spark jobserver which is running on k8s.
Hello, I'm new to both Spark and Spark JobServer. Here's my attempt at following cluster doc and making this work on Kubernetes:
sbt docker
master = "spark://spark-master:7077"
in docker.conf. spark-master
is the service name of Spark master.submit.deployMode = "cluster"
in docker.confspark.jobserver.context-per-jvm = true
in docker.confspark.jobserver.sqldao
to point to an in-cluster Postgres serviceREMOTE_JOBSERVER_DIR=file://jobserver
in settings.sh. My rationale is that each worker will see this setting and look for spark-jobserver files in this local folder.spark-job-server.jar
and config files to /jobserver
dir in each worker podakka.remote.netty.tcp.hostname=0.0.0.0
. Clearly I don't know what this is for and why it need to be set.And I see the Spark jobserver pod running fine but when I try to create a new context it gives:
{
"status": "CONTEXT INIT ERROR",
"result": {
"message": "Context failed to connect back within initialization time",
...
Please let me know what might have gone wrong. Thanks!
Context failed to connect back within initialization time
This error means that driver process was launched but it failed to connect back with jobserver master within a specific timeout. Please check the driver logs to figure out the problem.
By driver you mean the jobserver itself? It initially appear that jobserver wanted to connect to master REST endpoint so I changed master to spark-master:6066
. However I get the same Context failed to connect back within initialization time
error and this time the log doesn't show anything of note:
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/12/23 13:18:14 INFO RestSubmissionClient: Submitting a request to launch an application in spark://spark-master:6066.
19/12/23 13:18:14 INFO RestSubmissionClient: Submission successfully created as driver-20191223131814-0003. Polling submission state...
19/12/23 13:18:14 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20191223131814-0003 in spark://spark-master:6066.
19/12/23 13:18:14 INFO RestSubmissionClient: State of driver driver-20191223131814-0003 is now SUBMITTED.
19/12/23 13:18:14 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20191223131814-0003",
"serverSparkVersion" : "2.3.2",
"submissionId" : "driver-20191223131814-0003",
"success" : true
}
19/12/23 13:18:14 INFO ShutdownHookManager: Shutdown hook called
19/12/23 13:18:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-d8b8f377-3caf-4380-9c18-b9d0ec1505c3
Actually no. Jobserver is a separate process which executes spark-submit to launch "drivers". Drivers are totally different JVMs and are controlled by Spark.
The logs are telling me that jobserver request spark to launch a driver. Spark accepted the request to launch driver "driver-20191223131814-0003" but the driver went into SUBMITTED state instead of RUNNING.
Since Spark never ran the driver, jobserver throws the timeout exception. You need to check the spark logs/SparkUI to figure out why the driver was not launched. Genrally this happens if Spark cluster does not have enough resources (CPU/RAM).
Thanks for the pointer. The worker looked for jobserver jar at /app
dir but cannot find it so I put the jar there and it can now run the driver. However it seems the driver can't connect to jobmanager/jobserver? I see this at the end of worker logs:
2019-12-23 14:20:56 INFO DriverRunner:54 - Launch Command: "/usr/lib/jvm/java-8-openjdk-amd64/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.jars=file:/app/spark-job-server.jar" "-Dspark.driver.supervise=false" "-Dspark.submit.deployMode=cluster" "-Dspark.master=spark://spark-master-f784b584c-sljvj:7077" "-Dspark.app.name=sql-context-1" "-Dspark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC
-verbose:gc -XX:+PrintGCTimeStamps
-XX:MaxPermSize=512m
-XX:+CMSClassUnloadingEnabled -Xloggc:gc.out -XX:MaxDirectMemorySize=512M
-XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true -Dlog4j.configuration=file:/app/log4j-server.properties " "-Dspark.driver.memory=1g" "-Dspark.executor.extraJavaOptions=-Dlog4j.configuration=file:/app/log4j-server.properties" "-XX:+UseConcMarkSweepGC" "-verbose:gc" "-XX:+PrintGCTimeStamps" "-XX:MaxPermSize=512m" "-XX:+CMSClassUnloadingEnabled" "-Xloggc:gc.out" "-XX:MaxDirectMemorySize=512M" "-XX:+HeapDumpOnOutOfMemoryError" "-Djava.net.preferIPv4Stack=true" "-Dlog4j.configuration=file:/app/log4j-server.properties" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@10.44.226.209:36545" "/spark/work/driver-20191223142056-0007/spark-job-server.jar" "spark.jobserver.JobManager" "akka.tcp://JobServer@0.0.0.0:7080" "jobManager-c9d1809a-8d97-4f14-92fe-60a457c00c62" "file:/app/docker.conf"
I see this: "spark.jobserver.JobManager" "akka.tcp://JobServer@0.0.0.0:7080"
. Pretty sure that was because I set akka.remote.netty.tcp.hostname
to 0.0.0.0
. This is quite a bind because while it's fine for jobserver akka to bind to 0.0.0.0
, the worker should instead try to reach the jobserver at spark-jobserver
because that's the service name.
Is there any way to set this hostname value differently for worker and jobserver?
In jobserver, you have a few options that were implemented some while ago.
You can set the property spark.jobserver.network-address-resolver
to either
The config also have a description that you can read to decide. The default in akka
but then you need to not set anything for hostname
.
Here is the original commit with more information in the commit message https://github.com/spark-jobserver/spark-jobserver/commit/1fd6ae5bc1a8d2da789843d0eeb25fc5e5094868
Thanks! If Akka figure out which ip to bind to via /etc/hosts
file then that makes thing simple. I just need to add to hostAliases
section of my pods. Another problem is how the driver bind to a random port (i.e. 35619). How do I tell the driver which port to bind to?
I see these lines in jobserver logs:
[2019-12-30 13:25:41,753] INFO .Cluster(akka://JobServer) [] [akka.cluster.Cluster(akka://JobServer)] - Cluster Node [akka.tcp://JobServer@spark-jobserver:7080] - Node [akka.tcp://JobServer@10.44.234.54:35619] is JOINING, roles [manager]
...
[2019-12-30 13:26:41,978] INFO AkkaClusterSupervisorActor [] [] - Failed to send initialize message to context Actor[akka.tcp://JobServer@10.44.234.54:35619/user/jobManager-665fe58b-4a34-48ef-b5f0-b9d0b609cbf3#-994358033]
Is there anyway to tell cluster node which hostname and which port to bind to?
Also I see this in driver's stdout:
[2019-12-30 13:25:42,761] WARN neAppClient$ClientEndpoint [] [akka://JobServer/user/jobManager-665fe58b-4a34-48ef-b5f0-b9d0b609cbf3] - Failed to connect to master spark-master-f784b584c-gwz7k:7077
Is there anyway to tell driver to connect to "spark-master" (service name) instead of "spark-master-f784b584c-gwz7k" (container hostname)?
Is there anyway to tell driver to connect to "spark-master" (service name) instead of "spark-master-f784b584c-gwz7k" (container hostname)?
This error seems to be K8s Spark deployment specific. I can give you an answer for spark-standalone resource manager and you can find something similar for k8s. In standalone, we either set the property spark.master
to the right IP (in your case service-name) or while spinning up the spark worker JVM pass the spark.master IPs like
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP1:7077,IP2:7077
Is there anyway to tell cluster node which hostname and which port to bind to?
By cluster node, you mean driver?
If you are using akka
network strategy then both jobserver and drivers will use InetAddress.getLocalHost.getHostAddress
as the hostname.
The port is a bit more tricky. For jobserver, it is better to hardcode the port (especially if you are using supervise mode) here. For drivers the port is always random for (cluster mode) see this.
A bit more info, generally we share the config file between jobserver and drivers, this means that the same port will be assigned to both jobserver and drivers. The problem arrises when you spin up multiple drivers on the same VM because the port conflict arises, so jobserver code for drivers sets the port to random (setting port to 0 means random in Akka terms).
How do I tell the driver which port to bind to?
Not a good idea, since you always want more then 1 driver to be running in the Spark cluster. Due to this for each driver you will need to tell which port to bind to. So, we go with a random port.
Btw what is the k8s specific problem that you are having due to which you want to bind the ports to drivers?
[2019-12-30 13:26:41,978] INFO AkkaClusterSupervisorActor [] [] - Failed to send initialize message to context Actor[akka.tcp://JobServer@10.44.234.54:35619/user/jobManager-665fe58b-4a34-48ef-b5f0-b9d0b609cbf3#-994358033]
The above log line, gives me a clue that there is a networking issue in the setup.
K8s does not support random port binding. Ideally each driver should run in it's own pod, not sharing the same pod as the spark worker. That way one could identify each driver via their hostname (service name) instead of port. But then spark/jobserver need to talk to k8s to do that. Sounds like something only k8s scheduler will be able to provide. I have been trying to run Spark standalone on k8s but looks like it won't work out. Will try things out with k8s scheduler next.
Actually if jobserver code for drivers sets the port to random then even k8s scheduler might not be able to help.
We can always adjust the code because sooner or later we need the support for k8s.
btw are you sure that random ports wont work? This sounds like a common scenario to me. Also, do we even have a spark worker in k8s? I thought we just launch drivers with k8s resource manager.
See this: https://github.com/kubernetes/kubernetes/issues/49792
Kubernetes only run containers and containers are ephemeral things and best practice dictate that it should only run one process. So they are more like processes less like VMs.
And from my observation there haven't been a real push for random port feature, not for the purpose of process identification anyhow. K8s already has features such as statefulset for that.
I use the images & k8s manifests from big-data-europe to spin up master and worker. With this setup it seems that there is a set of always running worker pods that will run any application submitted to the master. The jobserver driver seems to launch just fine within worker pod.
ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.
You already mentioned this above. At that time, I didn't get that fully. Now, I understand.
Yes, i think it would not work. Jobserver submits the request to spark-master which commands the workers to spin up the driver. The workers should know how to spin up the driver by somehow communicating with the k8s API which they don't (I believe). So, for now, your only hope is k8s scheduler or multiple drivers in the same pod.
Yes. I think neither will work with jobserver for now.
On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander notifications@github.com wrote:
ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBHF7YI3O3IGNLAENGTQ3XXXTA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6M35Q#issuecomment-570215926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ .
We use standalone spark cluster running in k8s and along with that spark jobserver deployed on k8s. What we are using is an old version of SJS. I am not sure whether everything will work as it is with the new one.
On Fri, Jan 3, 2020 at 4:57 AM Khoi Pham notifications@github.com wrote:
Yes. I think neither will work with jobserver for now.
On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander notifications@github.com wrote:
ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBHF7YI3O3IGNLAENGTQ3XXXTA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6M35Q#issuecomment-570215926 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AABF54FIYAKTRWAU75WGR3LQ3ZZ4JA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH7ZEAQ#issuecomment-570397186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABF54EDEIAXSQKRTDQEVVLQ3ZZ4JANCNFSM4ETKF4TQ .
How did you overcome random port binding issue? Did you run jobserver and worker/driver on the same container?
On Fri, Jan 3, 2020 at 10:23 AM Noorul Islam K M notifications@github.com wrote:
We use standalone spark cluster running in k8s and along with that spark jobserver deployed on k8s. What we are using is an old version of SJS. I am not sure whether everything will work as it is with the new one.
On Fri, Jan 3, 2020 at 4:57 AM Khoi Pham notifications@github.com wrote:
Yes. I think neither will work with jobserver for now.
On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander notifications@github.com wrote:
ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub <
, or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AABF54FIYAKTRWAU75WGR3LQ3ZZ4JA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH7ZEAQ#issuecomment-570397186 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AABF54EDEIAXSQKRTDQEVVLQ3ZZ4JANCNFSM4ETKF4TQ
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBAVSIUFFURFCABEMG3Q32VR7A5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIAFOBA#issuecomment-570447620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBBGYM2K6LQD2JMQC4DQ32VR7ANCNFSM4ETKF4TQ .
I believe the random port is coming from the job server Akka worker processes is that right? There is a way to configure a static port in Akka that works with K8s. Not easy but could be done.
On Thu, Jan 2, 2020 at 8:29 PM Khoi Pham notifications@github.com wrote:
How did you overcome random port binding issue? Did you run jobserver and worker/driver on the same container?
On Fri, Jan 3, 2020 at 10:23 AM Noorul Islam K M <notifications@github.com
wrote:
We use standalone spark cluster running in k8s and along with that spark jobserver deployed on k8s. What we are using is an old version of SJS. I am not sure whether everything will work as it is with the new one.
On Fri, Jan 3, 2020 at 4:57 AM Khoi Pham notifications@github.com wrote:
Yes. I think neither will work with jobserver for now.
On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander <notifications@github.com
wrote:
ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub <
, or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <
, or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AABF54EDEIAXSQKRTDQEVVLQ3ZZ4JANCNFSM4ETKF4TQ
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBAVSIUFFURFCABEMG3Q32VR7A5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIAFOBA#issuecomment-570447620 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAC2RBBGYM2K6LQD2JMQC4DQ32VR7ANCNFSM4ETKF4TQ
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAIDPW62FMKLYSFFD5PUW6TQ32WIPA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIAFXOA#issuecomment-570448824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIDPW3JPYX55JZRGZKEHFTQ32WIPANCNFSM4ETKF4TQ .
-- If you are free, you need to free somebody else. If you have some power, then your job is to empower somebody else. --- Toni Morrison
@velvia yes port is coming from Akka
Hi, I am struggling a bit here as well, in trying to deploy job-server to k8s.
If @noorul , @bsikander or someone can add a comment or some notes on how this can be made possible, it will be great!
@drametoid Where is your spark scheduler running? For sjs to work on k8s spark should be running on k8s. Otherwise, you will face lots of networking related issues.
May I know jobserver on k8s is officially supported or not? If not, what is official way to communicate with Spark on K8s, Livy?
I am using a Kubernetes cluster instead of the Yarn cluster and my Spark Job Server is outside the Kubernetes cluster. I want to identify that whether the Spark Job Server has support to launch/manage Spark context inside the Kubernetes cluster. I suppose this would require SJS to be able to communicate with the Kubernetes API server. The following diagram explains my use case.
Hello, Is anyone available to answer my question? I have posted some details but if they are insufficient please let me know and I can elaborate further.
@noorul, @velvia @bsikander - I want to work on this Feature request and contribute back. How should I proceed? Can I have a mentor who can provide me some guidance with this?
@sj123050037 sorry for a delayed response. Yes, we can help you. How much are you familiar with jobserver codebase and k8s?
Thanks, @bsikander! Thanks for your reply! Below is the architecture which I am looking forward to. I am a beginner with Kubernetes and not very much familiar with the SJS code base. I have experience using SJS in Yarn cluster mode.
Hi, couple of comments from my side, maybe they will be helpful for @sj123050037 or someone else, who is also interested in this change (SJS on K8S using Spark on K8S):
You can start testing the code by using the usual local Jobserver setup and just trying to publish a context in Kubernetes.
JobManager is an entry point for the context. JobManager
checks provided parameters (Akka connect string, context name, configuration file) and creates JobManagerActor
. JobManagerActor
is the Akka actor, which runs inside the driver, communicates with the main Jobserver and runs a job, if Jobserver gives such command.
I think that K8S spark-submit command will look similar to this one (depends on where your jar is, how you provide configuration, and so on):
./spark-submit --master k8s://https://localhost:6443 \
--deploy-mode cluster \
--name sjs-driver \
--class spark.jobserver.JobManager \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=${DOCKER_HUB_USER}/spark:latest \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=http://${IP}:8080/log4j-server.properties" \
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=http://${IP}:8080/log4j-server.properties" \
http://${IP}:8080/spark-job-server.jar "akka.tcp://JobServer@${IP}:2552" contextname "/opt/spark/sjs/kubernetes.conf"
In my case I opened local HTTP server with Python to share jar file and log4j configurations with pod. To allow TCP connections from a certain port for Akka, you need to tweak the settings in application.conf, it is also mentioned in the comments above.
K8S is similar to any other resource manager, so there should not be any significant changes, just networking and file sharing should be checked:
MANAGER_CONF_FILE
) for JobManager can be read only from HDFS or from the local filesystem. For my tests I packaged it in Spark container in the end, because --files
parameter for spark-submit is somehow not working. The files are copied inside the container, but they seem to be misplaced: https://issues.apache.org/jira/browse/SPARK-31726.java -cp /path/to/h2/bin/h2.jar org.h2.tools.Server -ifNotExists
.Forceful kill of the context should be checked (DELETE /contexts/name?force=true
) because it relies on Spark UI features, so most likely, it should be documented that this feature doesn't work.
Metrics API may be also affected by this change and should be checked as well.
There is a working (check open PRs, maybe there are some fixes, because it often gets broken :)) Docker image, which can be compiled by running sbt docker
. Use export SCALA_VERSION=2.11.x
(with your local version of Scala) to compile Docker image with Scala 2.11 instead of the default 2.12. Update config/docker.conf
to change configuration.
Hi,
Spark 2.3 shipped with experimental support for Kubernetes. The reality is that Spark on Kubernetes has been battle-tested in a few companies, the "experimental" parts sound to me more like a caveat for configuration changes.
Are there plans to make spark-jobserver run on Kubernetes?
Thanks!