Execute spark scala/python remotely.

snowch / biginsight-examples

Example projects to help you quickly get started with BigInsights

Apache License 2.0

7 stars 4 forks source link

Execute spark scala/python remotely. #51

Open pregazzoni opened 8 years ago

pregazzoni commented 8 years ago

Right now we have to ssh code over to cluster and run it "locally".

Need to look at alternative: https://developer.ibm.com/hadoop/blog/2016/04/26/submit-spark-jobs-via-rest-iop-4-1-using-livy/

or also see https://issues.apache.org/jira/browse/KNOX-516

snowch commented 8 years ago

+1 for being able to execute spark jobs remotely.

Rohan pointed me at livy a few days back. Although it looks really cool, I'm not sure I'd want the ability to run spark jobs remotely have a dependencies on users installing livy.

Another option is to push engineering to fix the OozieWorkflowSpark example. It's not good that there isn't an option to remotely run spark scripts out-of-the-box on BigInsights.

At the moment, I see the development workflow as:

developing spark jobs over ssh which gives you lots of output, and then
migrate them to oozie when you are ready to productionise them (assuming the issue with oozie is going to get fixed)

Does that approach make sense?

snowch commented 8 years ago

I've had some feedback that 4.2 will fix the issues with webhcat and oozie.

snowch commented 8 years ago

@pregazzoni are there any other options for executing spark jobs remotely without having to install something on the cluster?

pregazzoni commented 8 years ago

Need some research, have not found one yet.

pregazzoni commented 8 years ago

So i can do remote execution like this, so you get the idea:

vagrant@vagrant-ubuntu-trusty-64:~$ sudo -u ambari-qa \
HADOOP_CONF_DIR=/home/vagrant/hadoop-conf-hdtest102  \
spark-1.5.1-bin-hadoop2.6/bin/spark-submit \
  --master yarn-cluster --num-executors 2 --executor-cores 1 --executor-memory 1G \
  --class "org.apache.spark.examples.WordCount" \
  /tmp/scala-examples.jar /tmp/LICENSE

Now to do same with kerberize cluster we will need to setup kerberos client/conf (REALM). Get a valid key tab. Might not be trivial for BIoC. Right now you could use an "edge" node configured as part of BIoC potentially but that's same as ssh into master manager basically.

Also need to make sure to set -Diop.version=actual-version or update conf/ files accordingly as you would see "Bad substitution" error otherwise.