An up-to-date deployment process for Hadoop ecosystem on Gentoo Linux. The ebuilds were collected from different repositories and updated to align with the latest software versions and the deployments modes described below
The objective of this projet is to ease the installation and deployment of Hadoop components on Gentoo Linux. It supports 2 deployment modes
/usr/local/portage/
for instance/etc/portage/package.accept_keywords
with the included filefind /usr/local/portage/ -name *.ebuild -exec ebuild {} digest \;
(this is a temporary solution until use of Gentoo overlay)
hdfs
, yarn
, etc.Preparation
/etc/hosts
by adding the server(s) supported by each host in the comments. Also add the keyword sandbox
to each line if you want a Sandbox deployment with minimal settings.
Example:
192.168.56.11 hadoop1.mydomain.com hadoop1 # sandbox namenode datanode nodemanager resourcemanager
192.168.56.12 hadoop2.mydomain.com hadoop2 # sandbox secondarynamenode datanode nodemanager historyserver
If not done, installation will assume a single-node cluster
Installation
emerge sys-cluster/apache-hadoop-bin
su - hdfs -c 'hdfs namenode -format' # format the namenode
rc-service hadoop-namenode start # start the namenode
rc-service hadoop-datanode start # start the datanode
su - hdfs -c 'hadoop fs -mkdir -p /tmp/hadoop-yarn ; hadoop fs -chmod 777 /tmp/hadoop-yarn' # create TMP dir
rc-service hadoop-xxxx start # start module xxx
This package will create the Unix users hdfs:hadoop
, yarn:hadoop
and mapred:hadoop
(if they do not exist).
Configuration
Basically everything is configured automatically. The environment files hadoop-env.sh
, yarn-env.sh
and mapred-env.sh
are updated with proper $JAVA_HOME
and a minimal JAVA Heap size in case of sandbox
The properties files are updated as below
core-site.xml
fs.defaultFS # hdfs://<hostname of "namenode">
hdfs-site.xml
dfs.namenode.name.dir # file:/var/lib/hdfs/name
dfs.datanode.data.dir # file:/var/lib/hdfs/data
dfs.namenode.secondary.http-address # <hostname of "secondarynode">:50090
dfs.replication # number of data nodes if <3 otherwise 3
dfs.blocksize # 10M if sandbox otherwise default
dfs.permissions.superusergroup # set to 'hadoop'
yarn-site.xml
yarn.nodemanager.aux-services # mapreduce_shuffle
yarn.resourcemanager.hostname # hostname of "resourcemanager"
yarn.nodemanager.resource.memory-mb # set to minimal value if sandbox
yarn.nodemanager.resource.cpu-vcores # set to 1 if sandbox
yarn.scheduler.maximum-allocation-mb # set to memory-mb/3 if sandbox
yarn.nodemanager.vmem-pmem-ratio # set to 1 if sandbox
mapred-site.xml
mapreduce.framework.name # yarn
mapreduce.jobhistory.addresss # <hostname of "historyserver">:10020
yarn.app.mapreduce.am.resource.mb # set to minimal value if sandbox
mapreduce.map.memory.mb # set to minimal value if sandbox
mapreduce.reduce.memory.mb # set to minimal value if sandbox
Verifications
hadoop
hadoop fs -mkdir -p /user/guest
hadoop fs -put /usr/portage/distfiles/hadoop-2.7.1.tar.gz
hadoop fs -rm /usr/portage/distfiles/hadoop-2.7.1.tar.gz
Installation
emerge dev-lang/apache-pig-bin
Verifications
https://cwiki.apache.org/confluence/download/attachments/27822259/pigtutorial.tar.gz
, extract from the archive the file excite.log.bz2
and unzip ithadoop fs -put excite.log
(with sandbox settings file is split in 4 blocks)pig
then
a = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query:chararray);
b = FILTER a BY (query MATCHES '.*queen.*');
STORE b into 'verif_pig';
Issues
pig -x tez
not yet supportedpig -useHCatalog
: add to CLASSPATH datanucleus-*.jar and jdbc-mysql.jarInstallation
root
emerge dev-db/apache-hive-bin
This package will create the Unix user hive:hadoop
Verifications
hive
then enter following HQL lines
CREATE TABLE sample (userid STRING,time INT,query STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA INPATH 'excite.log' OVERWRITE INTO TABLE sample;
SELECT COUNT(*) FROM sample;
-- this will run a mapreduce job that should return 944954
DROP TABLE sample;
emerge dev-db/apache-hbase-bin
Verifications in progress
emerge sys-cluster/apache-sqoop-bin
Verifications
USE test; CREATE TABLE sample (userid varchar(100), time INT,query varchar(100));
LOAD DATA INFILE '/home/hadoop/excite.log' INTO TABLE sample FIELDS TERMINATED BY '\t';
/opt/sqoop/bin/sqoop import --connect jdbc:mysql://localhost/test --username root --pasword *** --table sample -m 1
sample
Preparation
/etc/hosts
. Optionally add the keyword sandbox
for a deployment with minimal settings192.168.56.11 hadoop1.mydomain.com hadoop1 # sandbox sparkmaster
Installation
emerge sys-cluster/apache-spark-bin
rc-service spark-master start
rc-service spark-worker start # to be done on each cluster node
This package will create the Unix user spark:hadoop
and
Configuration
Spark configuration can be found in /etc/spark
Verifications
pyspark
:
sc.textFile("SAMPLE.txt").flatMap(lambda s: s.split(" ")).map(lambda s: (s, 1)).reduceByKey(lambda a, b: a + b).collect()
Installation
emerge dev-db/apache-solr-bin
rc-service solr start
rc-update add solr
Verifications
Note: cassandra has no dependency with Hadoop Common packages and can be installed separately. Preparation
To install cassandra in cluster mode just add the keyword cassandraseed
in /etc/hosts
for the seed(s)
The keyword sandbox
can be added too to reduce the memory settings to minimum
Installation
emerge dev-db/apache-cassandra-bin
rc-service cassandra start # start the DB (to be done on all cluster nodes)
su - cassandra nodetool status # cluster status
Verifications