jtriley / StarCluster

StarCluster is an open source cluster-computing toolkit for Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
582 stars 308 forks source link

Spark plugin #496

Open cancan101 opened 9 years ago

cancan101 commented 9 years ago

Currently Spark has it's own script for launching EC2 instances: https://spark.apache.org/docs/0.8.1/ec2-scripts.html and source.

I filed a ticker for Spark here: https://issues.apache.org/jira/browse/SPARK-5637.

nchammas commented 9 years ago

As noted on SPARK-5637, I don't think it makes sense to somehow turn spark-ec2 into a plugin for StarCluster.

That said, if this community wants to independently develop a Spark plugin, or somehow lay out clear steps for how spark-ec2 and StarCluster integration would look, that would be helpful.

Also, FWIW, the latest docs on spark-ec2 are here: https://spark.apache.org/docs/latest/ec2-scripts.html

ypopkov commented 8 years ago

that's all there is to it:

"""                                                                                                                                        
Starts Spark daemons in standalone mode on starcluster.                                                                                    
Prerequisites: spark, spark's "sbin" folder in PATH in .bashrc                                                                               
"""
from starcluster.clustersetup import DefaultClusterSetup
from starcluster.logger import log
from starcluster.utils import print_timing

class SetupSparkStandalone(DefaultClusterSetup):

    def run(self, nodes, master, user, user_shell, volumes):
        master.ssh.execute('source ~/.bashrc && start-master.sh')
        for node in nodes:
            if not node.is_master():
                node.ssh.execute('source ~/.bashrc && start-slave.sh master:7077')

    def on_add_node(self, node, nodes, master, user, user_shell, volumes):
        node.ssh.execute('source ~/.bashrc && start-slave.sh master:7077')

    def on_remove_node(self, node, nodes, master, user, user_shell, volumes):
        node.ssh.execute('source ~/.bashrc && stop-slave.sh')
ypopkov commented 8 years ago

One could also speed up the worker part (only relevant for really large clusters) by using process pool:

...
cmd = 'source ~/.bashrc && start-slave.sh master:7077'
for node in nodes:
      if not node.is_master():
          self.pool.simple_job(node.ssh.execute, (cmd,), jobid=node.alias)
      self.pool.wait(len(nodes) - 1)
...