Open cancan101 opened 9 years ago
As noted on SPARK-5637, I don't think it makes sense to somehow turn spark-ec2 into a plugin for StarCluster.
That said, if this community wants to independently develop a Spark plugin, or somehow lay out clear steps for how spark-ec2 and StarCluster integration would look, that would be helpful.
Also, FWIW, the latest docs on spark-ec2 are here: https://spark.apache.org/docs/latest/ec2-scripts.html
that's all there is to it:
"""
Starts Spark daemons in standalone mode on starcluster.
Prerequisites: spark, spark's "sbin" folder in PATH in .bashrc
"""
from starcluster.clustersetup import DefaultClusterSetup
from starcluster.logger import log
from starcluster.utils import print_timing
class SetupSparkStandalone(DefaultClusterSetup):
def run(self, nodes, master, user, user_shell, volumes):
master.ssh.execute('source ~/.bashrc && start-master.sh')
for node in nodes:
if not node.is_master():
node.ssh.execute('source ~/.bashrc && start-slave.sh master:7077')
def on_add_node(self, node, nodes, master, user, user_shell, volumes):
node.ssh.execute('source ~/.bashrc && start-slave.sh master:7077')
def on_remove_node(self, node, nodes, master, user, user_shell, volumes):
node.ssh.execute('source ~/.bashrc && stop-slave.sh')
One could also speed up the worker part (only relevant for really large clusters) by using process pool:
...
cmd = 'source ~/.bashrc && start-slave.sh master:7077'
for node in nodes:
if not node.is_master():
self.pool.simple_job(node.ssh.execute, (cmd,), jobid=node.alias)
self.pool.wait(len(nodes) - 1)
...
Currently Spark has it's own script for launching EC2 instances: https://spark.apache.org/docs/0.8.1/ec2-scripts.html and source.
I filed a ticker for Spark here: https://issues.apache.org/jira/browse/SPARK-5637.