hbraux / gentoo-hadoop

2 stars 2 forks source link

this project is no longer supported

Gentoo Hadoop

An up-to-date deployment process for Hadoop ecosystem on Gentoo Linux. The ebuilds were collected from different repositories and updated to align with the latest software versions and the deployments modes described below

Motivation

The objective of this projet is to ease the installation and deployment of Hadoop components on Gentoo Linux. It supports 2 deployment modes

  1. Standard
  2. Sandbox in a single or multi-node cluster with minimal resource consumption (ability to run on small VMs with 1 core/2GB each)

Installation Prequisites

Installation Rules

Components

Apache Hadoop Common (2.7.1)

Preparation

Installation

emerge sys-cluster/apache-hadoop-bin
su - hdfs -c 'hdfs namenode -format'   # format the namenode
rc-service hadoop-namenode start       # start the namenode
rc-service hadoop-datanode start       # start the datanode
su - hdfs -c 'hadoop fs -mkdir -p /tmp/hadoop-yarn ; hadoop fs -chmod 777 /tmp/hadoop-yarn' # create TMP dir
rc-service hadoop-xxxx start            # start module xxx

This package will create the Unix users hdfs:hadoop, yarn:hadoop and mapred:hadoop (if they do not exist).

Configuration

Basically everything is configured automatically. The environment files hadoop-env.sh, yarn-env.sh and mapred-env.sh are updated with proper $JAVA_HOME and a minimal JAVA Heap size in case of sandbox The properties files are updated as below

core-site.xml
  fs.defaultFS          # hdfs://<hostname of "namenode">
hdfs-site.xml
  dfs.namenode.name.dir # file:/var/lib/hdfs/name
  dfs.datanode.data.dir # file:/var/lib/hdfs/data
  dfs.namenode.secondary.http-address # <hostname of "secondarynode">:50090
  dfs.replication       # number of data nodes if <3 otherwise 3
  dfs.blocksize         # 10M if sandbox otherwise default
  dfs.permissions.superusergroup # set to 'hadoop'
yarn-site.xml
  yarn.nodemanager.aux-services # mapreduce_shuffle
  yarn.resourcemanager.hostname # hostname of "resourcemanager"
  yarn.nodemanager.resource.memory-mb  # set to minimal value if sandbox
  yarn.nodemanager.resource.cpu-vcores # set to 1 if sandbox
  yarn.scheduler.maximum-allocation-mb # set to memory-mb/3 if sandbox
  yarn.nodemanager.vmem-pmem-ratio     # set to 1 if sandbox
mapred-site.xml
  mapreduce.framework.name          # yarn
  mapreduce.jobhistory.addresss     # <hostname of "historyserver">:10020
  yarn.app.mapreduce.am.resource.mb # set to minimal value if sandbox
  mapreduce.map.memory.mb           # set to minimal value if sandbox
  mapreduce.reduce.memory.mb        # set to minimal value if sandbox

Verifications

Apache Pig (0.15.0)

Installation

emerge dev-lang/apache-pig-bin

Verifications

Apache Hive (1.2.1)

Installation

Apache HBase (1.0.2)

emerge dev-db/apache-hbase-bin

Verifications in progress

Apache Sqoop (1.4.6)

emerge sys-cluster/apache-sqoop-bin

Verifications

Spark (1.5.0, hadoop based version)

Preparation

Spark configuration can be found in /etc/spark

Verifications

Solr (5.3.1)

Installation

emerge dev-db/apache-solr-bin
rc-service solr start
rc-update add solr

Verifications

Cassandra (2.2.1 latest)

Note: cassandra has no dependency with Hadoop Common packages and can be installed separately. Preparation

To install cassandra in cluster mode just add the keyword cassandraseed in /etc/hosts for the seed(s) The keyword sandboxcan be added too to reduce the memory settings to minimum Installation

emerge dev-db/apache-cassandra-bin
rc-service cassandra start       # start the DB (to be done on all cluster nodes)
su - cassandra nodetool status   # cluster status

Verifications

To Do