scality / spark

Apache License 2.0
3 stars 0 forks source link

Clustering Deployment based on Docker

The spark tool can be used to provide a distributed cleanup for application level orphans (SOFS and S3 connectors) as well as find and remove ring orphans.

Requirements

Pull the docker spark-worker image on the servers you want to act as a spark node.

[root@node01 ~]# docker pull patrickdos/spark-worker

Pull the docker spark-master image on a server ( could be a spark node ).

[root@node01 ~]# docker pull patrickdos/spark-master

Starting the spark cluster

Warning: If you choose SOFS, TACO is mandatory, otherwise, will fail when it will create the path to output the results.

Starting the first worker of a 6 node cluster:

docker run --rm -dit  --net=host --name spark-worker \
           --hostname spark-worker  \
           --add-host spark-master:178.33.63.238 \
           --add-host spark-worker:178.33.63.238  \
           --add-host=node01:178.33.63.238  \
           --add-host=node02:178.33.63.219 \
           --add-host=node03:178.33.63.192 \
           --add-host=node04:178.33.63.213 \
           --add-host=node05:178.33.63.77 \
           --add-host=node06:178.33.63.220 \
           -v /ring/fs/spark/:/fs/spark \
           -v /var/tmp:/tmp \
            patrickdos/spark-worker

Starting the master of the 6 node cluster:

docker run --rm -dit --net=host --name spark-master \
           --hostname spark-master \
           --add-host spark-master:178.33.63.238 \
           --add-host=node01:178.33.63.238  \
           --add-host=node02:178.33.63.219 \
           --add-host=node03:178.33.63.192 \
           --add-host=node04:178.33.63.213 \
           --add-host=node05:178.33.63.77 \
           --add-host=node06:178.33.63.220 \
           patrickdos/spark-master

Configuration

Edit scripts/config/config.yaml and fill out the master field accordingly.

master: "spark://178.33.63.238:7077"

How to submit a job to the cluster

As you'll notice the python virtualenv should not the needed to submit the jobs since all the magic will happen inside the docker container.

[root@node01 ~]# cd /root/spark/scripts/
[root@node01 scripts]# python submit.py -s SOFS_FSCK/check_volume.py -r META

:warning: Submit the jobs exactly as shown above. Changes such as adding a ./ to submit.py (ie. python ./submit.py) or variables in the script name (ie. S3_FSCK/s3fsck${step}.py) can cause loading errors!

Single local spark Deployment

Requirements

We do recommend to run the local instance on the supervisor and adjust accordingly the configuration settings.

The more memory/cores you have the faster it is to process the MapReduce but the following should be safe. Please adjust it accordingly into the config/config.yml file.

spark.executor.cores: 2
spark.executor.instances: 2
spark.executor.memory: "6g"
spark.driver.memory: "6g"
spark.memory.offHeap.enabled: True
spark.memory.offHeap.size: "4g"

For 261622847 keys it takes:

261622847*90 = 23546056230bytes ~ 23546056230/1024 = 22994195 = 23546056230/1024/1024/1024 ~ 21GB
[root@node01 spark]# du /fs/spark/listkeys-IT.csv/
22388738    /fs/spark/listkeys-IT.csv/
[root@node01 spark]# du -sh /fs/spark/listkeys-IT.csv/ 
22G /fs/spark/listkeys-IT.csv/

Centos 7 Installation:Deploy the Spark Virtual env

http://packages.scality.com/extras/centos/7Server/x86_64/scality/spark_env.tgz

Untar it into any directory

[root@node01 tmp]# cd /root/
[root@node01 ~]# tar xzf spark_env.tgz

Centos 6 installation:Deploy the Spark Virtual env

Download python2.7 + Centos6 spark_env

http://sreport.scality.com/video/python-2.7-centos6.tgz

http://sreport.scality.com/video/spark_env-centos-6.tgz

Untar the env

[root@node01 tmp]# cd /root/
[root@node01 ~]# tar xvzf spark_env-centos-6.tgz

Untar the python2.7 libs

[root@node01 ~]# cd /
[root@node01 /]# tar cvzf /root/python-2.7-centos6.tgz

Create the following file

[root@node01 ~]# cat /etc/ld.so.conf.d/python27.conf
/usr/local/lib

load the lib

[root@node01 ~]# ldconfig

Update Java to version 1.8

[root@node01 ~]# yum -y install java-1.8.0-openjdk

Enable the virt_env + Download the spark scripts

Active the virtual env

[root@node01 ~]# source spark_env/bin/activate 

Clone the spark script repository

[root@node01 ~]# git clone git@github.com:scality/spark.git

Or Download the latest tarball

https://bitbucket.org/scality/spark/downloads/

Links to spark scripts Documentation

Some Scripts

Check Orphan/Removal

SOFS file-system consistency check