Open pregazzoni opened 8 years ago
+1 for being able to execute spark jobs remotely.
Rohan pointed me at livy a few days back. Although it looks really cool, I'm not sure I'd want the ability to run spark jobs remotely have a dependencies on users installing livy.
Another option is to push engineering to fix the OozieWorkflowSpark example. It's not good that there isn't an option to remotely run spark scripts out-of-the-box on BigInsights.
At the moment, I see the development workflow as:
Does that approach make sense?
I've had some feedback that 4.2 will fix the issues with webhcat and oozie.
@pregazzoni are there any other options for executing spark jobs remotely without having to install something on the cluster?
Need some research, have not found one yet.
So i can do remote execution like this, so you get the idea:
vagrant@vagrant-ubuntu-trusty-64:~$ sudo -u ambari-qa \
HADOOP_CONF_DIR=/home/vagrant/hadoop-conf-hdtest102 \
spark-1.5.1-bin-hadoop2.6/bin/spark-submit \
--master yarn-cluster --num-executors 2 --executor-cores 1 --executor-memory 1G \
--class "org.apache.spark.examples.WordCount" \
/tmp/scala-examples.jar /tmp/LICENSE
Now to do same with kerberize cluster we will need to setup kerberos client/conf (REALM). Get a valid key tab. Might not be trivial for BIoC. Right now you could use an "edge" node configured as part of BIoC potentially but that's same as ssh into master manager basically.
Also need to make sure to set -Diop.version=actual-version
or update conf/ files accordingly as you would see "Bad substitution" error otherwise.
Right now we have to ssh code over to cluster and run it "locally".
Need to look at alternative: https://developer.ibm.com/hadoop/blog/2016/04/26/submit-spark-jobs-via-rest-iop-4-1-using-livy/
or also see https://issues.apache.org/jira/browse/KNOX-516