Kubernetes support? - Githubissues

onzo-vlamp commented 6 years ago

Hi,

Spark 2.3 shipped with experimental support for Kubernetes. The reality is that Spark on Kubernetes has been battle-tested in a few companies, the "experimental" parts sound to me more like a caveat for configuration changes.

Are there plans to make spark-jobserver run on Kubernetes?

Thanks!

noorul commented 6 years ago

We are running jobserver on k8s. I don't think JobServer has any limitations as such which scheduler spark is using.

velvia commented 6 years ago

Noorul, do you want to contribute a doc on running job server on k8s? thanks,

On Mar 2 2018, at 4:35 pm, Noorul Islam K M notifications@github.com wrote:

We are running jobserver on k8s. I don't think JobServer has any limitations as such which scheduler spark is using. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub (https://github.com/spark-jobserver/spark-jobserver/issues/1038#issuecomment-370097766), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABA32x9IY_iFv4aoncym6G1TdIbX1SGNks5taeVOgaJpZM4Sai8n).

onzo-vlamp commented 6 years ago

Thanks for letting us know @noorul!

I suspected so, but the docs in jobserver mention explicitly yarn, standalone etc.

If you don't have time for a proper doc, even a few notes would be great 🙂

dennischin commented 6 years ago

Do you mean I should be able to submit spark jobs to kubernetes cluster?

Would that mean I simply have to modify local.conf and change master url?

master = k8s://xxxx

noorul commented 6 years ago

@dennischin It is not that straight forward. You spark master and workers should be running on k8s cluster.

dennischin commented 6 years ago

I might be completely off-base here. Please forgive me if I am totally wrong. Any comments/info would be very helpful.

I was reviewing https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works

Could not spark jobserver act as the "client" that performs the spark-submit to the kubernetes cluster?

In my mind I would not be running spark master and workers on k8s cluster but using the newly built k8s scheduler that will create driver and executors as pods on the k8s cluster itself.

noorul commented 6 years ago

I am not familiar with k8s scheduler for spark. But we are running spark standalone on k8s. If you run spark outside of k8s and have client in k8s there will be network issues because of which driver and worker communication will not work properly.

ejthsiao commented 6 years ago

@noorul,

When you said you have problems having spark outside of k8s and have client in k8s there will be network issues because of which driver and worker communication will not work properly.

Is this true with --deploy-mode = 'cluster'? If so, then the driver is deployed onto the worker nodes and everything should work?

noorul commented 6 years ago

In that case, the JobManager will be running on spark worker nodes. JobManager has to communicate back to spark jobserver which is running on k8s.

pckhoi commented 4 years ago

Hello, I'm new to both Spark and Spark JobServer. Here's my attempt at following cluster doc and making this work on Kubernetes:

Create a standalone Spark cluster v2.3.2 using images from big-data-europe
Clone this repo and create a new Spark Jobserver image with sbt docker
Set master = "spark://spark-master:7077" in docker.conf. spark-master is the service name of Spark master.
Set submit.deployMode = "cluster" in docker.conf
Set spark.jobserver.context-per-jvm = true in docker.conf
Set spark.jobserver.sqldao to point to an in-cluster Postgres service
Set REMOTE_JOBSERVER_DIR=file://jobserver in settings.sh. My rationale is that each worker will see this setting and look for spark-jobserver files in this local folder.
Copy spark-job-server.jar and config files to /jobserver dir in each worker pod
Set akka.remote.netty.tcp.hostname=0.0.0.0. Clearly I don't know what this is for and why it need to be set.

And I see the Spark jobserver pod running fine but when I try to create a new context it gives:

{
  "status": "CONTEXT INIT ERROR",
  "result": {
    "message": "Context failed to connect back within initialization time",
    ...

Please let me know what might have gone wrong. Thanks!

bsikander commented 4 years ago

Context failed to connect back within initialization time

This error means that driver process was launched but it failed to connect back with jobserver master within a specific timeout. Please check the driver logs to figure out the problem.

pckhoi commented 4 years ago

By driver you mean the jobserver itself? It initially appear that jobserver wanted to connect to master REST endpoint so I changed master to spark-master:6066. However I get the same Context failed to connect back within initialization time error and this time the log doesn't show anything of note:

Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/12/23 13:18:14 INFO RestSubmissionClient: Submitting a request to launch an application in spark://spark-master:6066.
19/12/23 13:18:14 INFO RestSubmissionClient: Submission successfully created as driver-20191223131814-0003. Polling submission state...
19/12/23 13:18:14 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20191223131814-0003 in spark://spark-master:6066.
19/12/23 13:18:14 INFO RestSubmissionClient: State of driver driver-20191223131814-0003 is now SUBMITTED.
19/12/23 13:18:14 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20191223131814-0003",
  "serverSparkVersion" : "2.3.2",
  "submissionId" : "driver-20191223131814-0003",
  "success" : true
}
19/12/23 13:18:14 INFO ShutdownHookManager: Shutdown hook called
19/12/23 13:18:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-d8b8f377-3caf-4380-9c18-b9d0ec1505c3

bsikander commented 4 years ago

Actually no. Jobserver is a separate process which executes spark-submit to launch "drivers". Drivers are totally different JVMs and are controlled by Spark.

The logs are telling me that jobserver request spark to launch a driver. Spark accepted the request to launch driver "driver-20191223131814-0003" but the driver went into SUBMITTED state instead of RUNNING.

Since Spark never ran the driver, jobserver throws the timeout exception. You need to check the spark logs/SparkUI to figure out why the driver was not launched. Genrally this happens if Spark cluster does not have enough resources (CPU/RAM).

pckhoi commented 4 years ago

Thanks for the pointer. The worker looked for jobserver jar at /app dir but cannot find it so I put the jar there and it can now run the driver. However it seems the driver can't connect to jobmanager/jobserver? I see this at the end of worker logs:

2019-12-23 14:20:56 INFO  DriverRunner:54 - Launch Command: "/usr/lib/jvm/java-8-openjdk-amd64/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.jars=file:/app/spark-job-server.jar" "-Dspark.driver.supervise=false" "-Dspark.submit.deployMode=cluster" "-Dspark.master=spark://spark-master-f784b584c-sljvj:7077" "-Dspark.app.name=sql-context-1" "-Dspark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC
         -verbose:gc -XX:+PrintGCTimeStamps
         -XX:MaxPermSize=512m
         -XX:+CMSClassUnloadingEnabled  -Xloggc:gc.out -XX:MaxDirectMemorySize=512M
         -XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true -Dlog4j.configuration=file:/app/log4j-server.properties  " "-Dspark.driver.memory=1g" "-Dspark.executor.extraJavaOptions=-Dlog4j.configuration=file:/app/log4j-server.properties" "-XX:+UseConcMarkSweepGC" "-verbose:gc" "-XX:+PrintGCTimeStamps" "-XX:MaxPermSize=512m" "-XX:+CMSClassUnloadingEnabled" "-Xloggc:gc.out" "-XX:MaxDirectMemorySize=512M" "-XX:+HeapDumpOnOutOfMemoryError" "-Djava.net.preferIPv4Stack=true" "-Dlog4j.configuration=file:/app/log4j-server.properties" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@10.44.226.209:36545" "/spark/work/driver-20191223142056-0007/spark-job-server.jar" "spark.jobserver.JobManager" "akka.tcp://JobServer@0.0.0.0:7080" "jobManager-c9d1809a-8d97-4f14-92fe-60a457c00c62" "file:/app/docker.conf"

I see this: "spark.jobserver.JobManager" "akka.tcp://JobServer@0.0.0.0:7080". Pretty sure that was because I set akka.remote.netty.tcp.hostname to 0.0.0.0. This is quite a bind because while it's fine for jobserver akka to bind to 0.0.0.0, the worker should instead try to reach the jobserver at spark-jobserver because that's the service name.

Is there any way to set this hostname value differently for worker and jobserver?

bsikander commented 4 years ago

In jobserver, you have a few options that were implemented some while ago. You can set the property spark.jobserver.network-address-resolver to either

Akka
Manual
Auto in config.

The config also have a description that you can read to decide. The default in akka but then you need to not set anything for hostname.

Here is the original commit with more information in the commit message https://github.com/spark-jobserver/spark-jobserver/commit/1fd6ae5bc1a8d2da789843d0eeb25fc5e5094868

pckhoi commented 4 years ago

Thanks! If Akka figure out which ip to bind to via /etc/hosts file then that makes thing simple. I just need to add to hostAliases section of my pods. Another problem is how the driver bind to a random port (i.e. 35619). How do I tell the driver which port to bind to?

pckhoi commented 4 years ago

I see these lines in jobserver logs:

[2019-12-30 13:25:41,753] INFO  .Cluster(akka://JobServer) [] [akka.cluster.Cluster(akka://JobServer)] - Cluster Node [akka.tcp://JobServer@spark-jobserver:7080] - Node [akka.tcp://JobServer@10.44.234.54:35619] is JOINING, roles [manager]
...
[2019-12-30 13:26:41,978] INFO  AkkaClusterSupervisorActor [] [] - Failed to send initialize message to context Actor[akka.tcp://JobServer@10.44.234.54:35619/user/jobManager-665fe58b-4a34-48ef-b5f0-b9d0b609cbf3#-994358033]

Is there anyway to tell cluster node which hostname and which port to bind to?

pckhoi commented 4 years ago

Also I see this in driver's stdout:

[2019-12-30 13:25:42,761] WARN  neAppClient$ClientEndpoint [] [akka://JobServer/user/jobManager-665fe58b-4a34-48ef-b5f0-b9d0b609cbf3] - Failed to connect to master spark-master-f784b584c-gwz7k:7077

Is there anyway to tell driver to connect to "spark-master" (service name) instead of "spark-master-f784b584c-gwz7k" (container hostname)?

bsikander commented 4 years ago

Is there anyway to tell driver to connect to "spark-master" (service name) instead of "spark-master-f784b584c-gwz7k" (container hostname)?

This error seems to be K8s Spark deployment specific. I can give you an answer for spark-standalone resource manager and you can find something similar for k8s. In standalone, we either set the property spark.master to the right IP (in your case service-name) or while spinning up the spark worker JVM pass the spark.master IPs like

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP1:7077,IP2:7077

Is there anyway to tell cluster node which hostname and which port to bind to?

By cluster node, you mean driver? If you are using akka network strategy then both jobserver and drivers will use InetAddress.getLocalHost.getHostAddress as the hostname. The port is a bit more tricky. For jobserver, it is better to hardcode the port (especially if you are using supervise mode) here. For drivers the port is always random for (cluster mode) see this.

A bit more info, generally we share the config file between jobserver and drivers, this means that the same port will be assigned to both jobserver and drivers. The problem arrises when you spin up multiple drivers on the same VM because the port conflict arises, so jobserver code for drivers sets the port to random (setting port to 0 means random in Akka terms).

How do I tell the driver which port to bind to?

Not a good idea, since you always want more then 1 driver to be running in the Spark cluster. Due to this for each driver you will need to tell which port to bind to. So, we go with a random port.

Btw what is the k8s specific problem that you are having due to which you want to bind the ports to drivers?

bsikander commented 4 years ago

[2019-12-30 13:26:41,978] INFO  AkkaClusterSupervisorActor [] [] - Failed to send initialize message to context Actor[akka.tcp://JobServer@10.44.234.54:35619/user/jobManager-665fe58b-4a34-48ef-b5f0-b9d0b609cbf3#-994358033]

The above log line, gives me a clue that there is a networking issue in the setup.

pckhoi commented 4 years ago

K8s does not support random port binding. Ideally each driver should run in it's own pod, not sharing the same pod as the spark worker. That way one could identify each driver via their hostname (service name) instead of port. But then spark/jobserver need to talk to k8s to do that. Sounds like something only k8s scheduler will be able to provide. I have been trying to run Spark standalone on k8s but looks like it won't work out. Will try things out with k8s scheduler next.

pckhoi commented 4 years ago

Actually if jobserver code for drivers sets the port to random then even k8s scheduler might not be able to help.

bsikander commented 4 years ago

We can always adjust the code because sooner or later we need the support for k8s.

bsikander commented 4 years ago

btw are you sure that random ports wont work? This sounds like a common scenario to me. Also, do we even have a spark worker in k8s? I thought we just launch drivers with k8s resource manager.

pckhoi commented 4 years ago

See this: https://github.com/kubernetes/kubernetes/issues/49792

Kubernetes only run containers and containers are ephemeral things and best practice dictate that it should only run one process. So they are more like processes less like VMs.

And from my observation there haven't been a real push for random port feature, not for the purpose of process identification anyhow. K8s already has features such as statefulset for that.

I use the images & k8s manifests from big-data-europe to spin up master and worker. With this setup it seems that there is a set of always running worker pods that will run any application submitted to the master. The jobserver driver seems to launch just fine within worker pod.

bsikander commented 4 years ago

ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.

You already mentioned this above. At that time, I didn't get that fully. Now, I understand.

Yes, i think it would not work. Jobserver submits the request to spark-master which commands the workers to spin up the driver. The workers should know how to spin up the driver by somehow communicating with the k8s API which they don't (I believe). So, for now, your only hope is k8s scheduler or multiple drivers in the same pod.

pckhoi commented 4 years ago

Yes. I think neither will work with jobserver for now.

On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander notifications@github.com wrote:

ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBHF7YI3O3IGNLAENGTQ3XXXTA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6M35Q#issuecomment-570215926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ .

noorul commented 4 years ago

We use standalone spark cluster running in k8s and along with that spark jobserver deployed on k8s. What we are using is an old version of SJS. I am not sure whether everything will work as it is with the new one.

On Fri, Jan 3, 2020 at 4:57 AM Khoi Pham notifications@github.com wrote:

Yes. I think neither will work with jobserver for now.

On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander notifications@github.com wrote:

ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBHF7YI3O3IGNLAENGTQ3XXXTA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6M35Q#issuecomment-570215926 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AABF54FIYAKTRWAU75WGR3LQ3ZZ4JA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH7ZEAQ#issuecomment-570397186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABF54EDEIAXSQKRTDQEVVLQ3ZZ4JANCNFSM4ETKF4TQ .

pckhoi commented 4 years ago

How did you overcome random port binding issue? Did you run jobserver and worker/driver on the same container?

On Fri, Jan 3, 2020 at 10:23 AM Noorul Islam K M notifications@github.com wrote:

We use standalone spark cluster running in k8s and along with that spark jobserver deployed on k8s. What we are using is an old version of SJS. I am not sure whether everything will work as it is with the new one.

On Fri, Jan 3, 2020 at 4:57 AM Khoi Pham notifications@github.com wrote:

Yes. I think neither will work with jobserver for now.

On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander notifications@github.com wrote:

ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBHF7YI3O3IGNLAENGTQ3XXXTA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6M35Q#issuecomment-570215926

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AABF54FIYAKTRWAU75WGR3LQ3ZZ4JA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH7ZEAQ#issuecomment-570397186 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AABF54EDEIAXSQKRTDQEVVLQ3ZZ4JANCNFSM4ETKF4TQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBAVSIUFFURFCABEMG3Q32VR7A5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIAFOBA#issuecomment-570447620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBBGYM2K6LQD2JMQC4DQ32VR7ANCNFSM4ETKF4TQ .

velvia commented 4 years ago

I believe the random port is coming from the job server Akka worker processes is that right? There is a way to configure a static port in Akka that works with K8s. Not easy but could be done.

On Thu, Jan 2, 2020 at 8:29 PM Khoi Pham notifications@github.com wrote:

How did you overcome random port binding issue? Did you run jobserver and worker/driver on the same container?

On Fri, Jan 3, 2020 at 10:23 AM Noorul Islam K M <notifications@github.com

wrote:

We use standalone spark cluster running in k8s and along with that spark jobserver deployed on k8s. What we are using is an old version of SJS. I am not sure whether everything will work as it is with the new one.

On Fri, Jan 3, 2020 at 4:57 AM Khoi Pham notifications@github.com wrote:

Yes. I think neither will work with jobserver for now.

On Thu, Jan 2, 2020, 9:02 PM Behroz Sikander <notifications@github.com

wrote:

ok so if I get you correct then you are not using k8s resource manager but rather trying to spin up standalone resource manager in k8s.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBHF7YI3O3IGNLAENGTQ3XXXTA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6M35Q#issuecomment-570215926

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAC2RBDT3XLMBFUILIFMQLLQ3XXXTANCNFSM4ETKF4TQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AABF54FIYAKTRWAU75WGR3LQ3ZZ4JA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH7ZEAQ#issuecomment-570397186

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AABF54EDEIAXSQKRTDQEVVLQ3ZZ4JANCNFSM4ETKF4TQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAC2RBAVSIUFFURFCABEMG3Q32VR7A5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIAFOBA#issuecomment-570447620 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAC2RBBGYM2K6LQD2JMQC4DQ32VR7ANCNFSM4ETKF4TQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spark-jobserver/spark-jobserver/issues/1038?email_source=notifications&email_token=AAIDPW62FMKLYSFFD5PUW6TQ32WIPA5CNFSM4ETKF4T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIAFXOA#issuecomment-570448824, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIDPW3JPYX55JZRGZKEHFTQ32WIPANCNFSM4ETKF4TQ .

-- If you are free, you need to free somebody else. If you have some power, then your job is to empower somebody else. --- Toni Morrison

bsikander commented 4 years ago

@velvia yes port is coming from Akka

drametoid commented 4 years ago

Hi, I am struggling a bit here as well, in trying to deploy job-server to k8s.

If @noorul , @bsikander or someone can add a comment or some notes on how this can be made possible, it will be great!

noorul commented 4 years ago

@drametoid Where is your spark scheduler running? For sjs to work on k8s spark should be running on k8s. Otherwise, you will face lots of networking related issues.

huangwei0413 commented 4 years ago

May I know jobserver on k8s is officially supported or not? If not, what is official way to communicate with Spark on K8s, Livy?

sj123050037 commented 4 years ago

I am using a Kubernetes cluster instead of the Yarn cluster and my Spark Job Server is outside the Kubernetes cluster. I want to identify that whether the Spark Job Server has support to launch/manage Spark context inside the Kubernetes cluster. I suppose this would require SJS to be able to communicate with the Kubernetes API server. The following diagram explains my use case.

sj123050037 commented 4 years ago

Hello, Is anyone available to answer my question? I have posted some details but if they are insufficient please let me know and I can elaborate further.

sj123050037 commented 3 years ago

@noorul, @velvia @bsikander - I want to work on this Feature request and contribute back. How should I proceed? Can I have a mentor who can provide me some guidance with this?

bsikander commented 3 years ago

@sj123050037 sorry for a delayed response. Yes, we can help you. How much are you familiar with jobserver codebase and k8s?

sj123050037 commented 3 years ago

Thanks, @bsikander! Thanks for your reply! Below is the architecture which I am looking forward to. HLD1 I am a beginner with Kubernetes and not very much familiar with the SJS code base. I have experience using SJS in Yarn cluster mode.

vglagoleva commented 3 years ago

Hi, couple of comments from my side, maybe they will be helpful for @sj123050037 or someone else, who is also interested in this change (SJS on K8S using Spark on K8S):

You can start testing the code by using the usual local Jobserver setup and just trying to publish a context in Kubernetes. JobManager is an entry point for the context. JobManager checks provided parameters (Akka connect string, context name, configuration file) and creates JobManagerActor. JobManagerActor is the Akka actor, which runs inside the driver, communicates with the main Jobserver and runs a job, if Jobserver gives such command. I think that K8S spark-submit command will look similar to this one (depends on where your jar is, how you provide configuration, and so on):
```
./spark-submit --master k8s://https://localhost:6443 \
--deploy-mode cluster \
--name sjs-driver \
--class spark.jobserver.JobManager \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=${DOCKER_HUB_USER}/spark:latest \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=http://${IP}:8080/log4j-server.properties" \
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=http://${IP}:8080/log4j-server.properties" \
http://${IP}:8080/spark-job-server.jar "akka.tcp://JobServer@${IP}:2552" contextname "/opt/spark/sjs/kubernetes.conf"
```
In my case I opened local HTTP server with Python to share jar file and log4j configurations with pod. To allow TCP connections from a certain port for Akka, you need to tweak the settings in application.conf, it is also mentioned in the comments above.
K8S is similar to any other resource manager, so there should not be any significant changes, just networking and file sharing should be checked:
- Configuration file (last parameter in my spark-submit command or MANAGER_CONF_FILE) for JobManager can be read only from HDFS or from the local filesystem. For my tests I packaged it in Spark container in the end, because --files parameter for spark-submit is somehow not working. The files are copied inside the container, but they seem to be misplaced: https://issues.apache.org/jira/browse/SPARK-31726.
- Driver/pod should also have an access to Jobserver DB, e.g. for H2 DB can be started like this: java -cp /path/to/h2/bin/h2.jar org.h2.tools.Server -ifNotExists.
Forceful kill of the context should be checked (DELETE /contexts/name?force=true) because it relies on Spark UI features, so most likely, it should be documented that this feature doesn't work.
Metrics API may be also affected by this change and should be checked as well.
There is a working (check open PRs, maybe there are some fixes, because it often gets broken :)) Docker image, which can be compiled by running sbt docker. Use export SCALA_VERSION=2.11.x (with your local version of Scala) to compile Docker image with Scala 2.11 instead of the default 2.12. Update config/docker.conf to change configuration.

spark-jobserver / spark-jobserver

Kubernetes support? #1038