The spark tool can be used to provide a distributed cleanup for application level orphans (SOFS and S3 connectors) as well as find and remove ring orphans.
scripts/SOFS_FSCK/README.md
scripts/S3_FSCK/README.md
scripts/orphan/README.md
Pull the docker spark-worker image on the servers you want to act as a spark node.
[root@node01 ~]# docker pull patrickdos/spark-worker
Pull the docker spark-master image on a server ( could be a spark node ).
[root@node01 ~]# docker pull patrickdos/spark-master
Warning: If you choose SOFS, TACO is mandatory, otherwise, will fail when it will create the path to output the results.
docker run --rm -dit --net=host --name spark-worker \
--hostname spark-worker \
--add-host spark-master:178.33.63.238 \
--add-host spark-worker:178.33.63.238 \
--add-host=node01:178.33.63.238 \
--add-host=node02:178.33.63.219 \
--add-host=node03:178.33.63.192 \
--add-host=node04:178.33.63.213 \
--add-host=node05:178.33.63.77 \
--add-host=node06:178.33.63.220 \
-v /ring/fs/spark/:/fs/spark \
-v /var/tmp:/tmp \
patrickdos/spark-worker
docker run --rm -dit --net=host --name spark-master \
--hostname spark-master \
--add-host spark-master:178.33.63.238 \
--add-host=node01:178.33.63.238 \
--add-host=node02:178.33.63.219 \
--add-host=node03:178.33.63.192 \
--add-host=node04:178.33.63.213 \
--add-host=node05:178.33.63.77 \
--add-host=node06:178.33.63.220 \
patrickdos/spark-master
Edit scripts/config/config.yaml and fill out the master field accordingly.
master: "spark://178.33.63.238:7077"
As you'll notice the python virtualenv should not the needed to submit the jobs since all the magic will happen inside the docker container.
[root@node01 ~]# cd /root/spark/scripts/
[root@node01 scripts]# python submit.py -s SOFS_FSCK/check_volume.py -r META
:warning: Submit the jobs exactly as shown above. Changes such as adding a ./ to submit.py (ie. python ./submit.py) or variables in the script name (ie. S3_FSCK/s3fsck${step}.py) can cause loading errors!
We do recommend to run the local instance on the supervisor and adjust accordingly the configuration settings.
The more memory/cores you have the faster it is to process the MapReduce but the following should be safe. Please adjust it accordingly into the config/config.yml file.
spark.executor.cores: 2
spark.executor.instances: 2
spark.executor.memory: "6g"
spark.driver.memory: "6g"
spark.memory.offHeap.enabled: True
spark.memory.offHeap.size: "4g"
eg:
ring> supervisor dsoStorage IT
Storage stats:
Disks: 46
Objects: 261622847
For 261622847 keys it takes:
261622847*90 = 23546056230bytes ~ 23546056230/1024 = 22994195 = 23546056230/1024/1024/1024 ~ 21GB
[root@node01 spark]# du /fs/spark/listkeys-IT.csv/
22388738 /fs/spark/listkeys-IT.csv/
[root@node01 spark]# du -sh /fs/spark/listkeys-IT.csv/
22G /fs/spark/listkeys-IT.csv/
http://packages.scality.com/extras/centos/7Server/x86_64/scality/spark_env.tgz
[root@node01 tmp]# cd /root/
[root@node01 ~]# tar xzf spark_env.tgz
http://sreport.scality.com/video/python-2.7-centos6.tgz
http://sreport.scality.com/video/spark_env-centos-6.tgz
[root@node01 tmp]# cd /root/
[root@node01 ~]# tar xvzf spark_env-centos-6.tgz
[root@node01 ~]# cd /
[root@node01 /]# tar cvzf /root/python-2.7-centos6.tgz
[root@node01 ~]# cat /etc/ld.so.conf.d/python27.conf
/usr/local/lib
[root@node01 ~]# ldconfig
[root@node01 ~]# yum -y install java-1.8.0-openjdk
[root@node01 ~]# source spark_env/bin/activate
[root@node01 ~]# git clone git@github.com:scality/spark.git
https://bitbucket.org/scality/spark/downloads/